AI Safety Atlas Ch.7 — Multi-Objective Generalization

Source: Multi-Objective Generalization

The mechanism behind goal misgeneralization: machine learning systems can learn correlated proxy objectives instead of intended goals despite correct training signals. Failures often invisible until deployment. See multi-objective-generalization.

The CoinRun Example (Canonical Demonstration)

CoinRun = procedurally-generated platformer where agents collect coins for rewards.

Training setup: coins consistently appear rightward. Result: two perfectly correlated patterns:

  • Intended goal: “collect coins”
  • Proxy goal: “move right”

When tested with randomly placed coins, agents continued moving right toward empty walls, ignoring visible coins.

“Agents learned ‘move right’ rather than ‘collect coins’ despite receiving correct reward signals.”

Goals Are Not Rewards

The distinction is fundamental. Training signals act as selection pressures favoring behavioral patterns — similar to evolutionary fitness pressures. The optimization process has no inherent preference for intended interpretations. Any pattern correlating with high reward becomes a candidate goal.

This differs from specification problems (specification-gaming):

  • Spec gaming: behavior diverges from training (loophole exploitation)
  • Goal misgeneralization: behavior is identical during training — both intended and proxy goals produce perfect performance

Detection is impossible until deployment.

Multi-Dimensional Generalization

Different objectives generalize independently. A system might:

  • Demonstrate strong capability generalization (performs well in new contexts)
  • While failing at goal generalization (pursuing different objectives)

“Capabilities follow simple, universal patterns… But there’s no universal force that pulls all systems toward the same goals.”

This asymmetry creates a core safety problem: increased capability doesn’t automatically improve alignment.

Causal Correlations and Distribution Shift

Learning algorithms develop causal models. In CoinRun’s training:

  • True causality: coin collection → reward
  • Spurious causality: rightward movement → reward

Both explain the data equally well during training.

Goal misgeneralization becomes visible only when deployment breaks training correlations. This is inevitable: “training environments cannot perfectly replicate all possible deployment conditions.”

Auto-Induced Distribution Shift

Amplifies the problem: when systems’ own actions change their data environment, creating feedback loops reinforcing misgeneralized goals.

A content recommendation system optimizing for engagement might shift user behavior toward sensational content, reinforcing its learned objective while drifting from the intended goal. This connects to epistemic-erosion — recommendation systems as systemic-risk actors.

The Orthogonality Thesis Empirically Supported

Goal misgeneralization is empirical evidence for the orthogonality thesis (Bostrom’s claim that intelligence and goals vary independently):

Systems retain sophisticated capabilities while pursuing unintended objectives. Capability improvements alone don’t ensure alignment.

This is the empirical foundation for the strategic arguments behind differential-development and ai-control — capability without alignment isn’t safe.

Connection to Wiki