AI Safety Atlas Ch.7 — Learning Dynamics

Source: Learning Dynamics

The training process guides which algorithms and goals are discovered. Results shaped by poorly understood preferences in the training system and by what’s easiest to compute. See learning-dynamics.

Each parameter configuration corresponds to a different algorithm. Training searches algorithm space — the path determines which algorithms emerge.

Training often discovers algorithms that work for unexpected reasons:

  • “Thneeb” classification — models learn to identify objects by color rather than shape
  • 100 identical BERT models trained on identical data with identical hyperparameters achieved nearly identical training performance but completely different reasoning approaches when tested on novel structures

Loss Landscapes

A loss landscape visualizes how performance changes across parameter configurations. “A fixed mountain range with peaks, valleys, ridges, and basins.” Random initialization places a model at a random starting point; gradient descent acts like a ball rolling downhill.

Geometry determines algorithmic discoverability:

  • Wide, flat valleys — robust, generalizable solutions stable under perturbation
  • Sharp, narrow peaks — brittle algorithms; tiny changes cause major performance drops
  • Wider valleys for misaligned goals → more likely to emerge from training

Path Dependence

Small training differences → fundamentally different algorithms.

  • High path dependence = high variance across runs
  • Low path dependence = consistent solutions

In complex landscapes with multiple deep valleys, starting position matters enormously. One starting position leads to “move right”; another discovers “collect coins” — completely different algorithms despite identical training performance.

Inductive Bias

Inductive biases = “systematic preferences of learning algorithms that favor certain types of solutions over others” — emerging from architecture, optimization, training setup rather than explicit programming.

Simplicity Bias (Most Important for Goal Misgeneralization)

Gradient descent favors algorithms relying on simple correlations over complex causal reasoning. Simple functions occupy vastly larger volumes in loss landscapes:

  • More ways to encode “move right”
  • Fewer ways to encode “navigate to coin-shaped objects using visual recognition and spatial reasoning”

Empirical: “simple functions are exponentially more likely to emerge than complex ones.”

Other Biases

  • Speed bias — favoring faster inference
  • Frequency bias — learning low-frequency patterns first

Safety Implications

Inductive biases create goal misgeneralization risks because correlation-based algorithms are typically simpler than causal reasoning:

  • “Helpful responses get approval” — simpler than
  • “Understand what humans actually need and provide that” — complex causal model

As training environments grow complex, the gap between intended goals and easily-learned proxies expands → misaligned algorithms become increasingly likely.

Connection to Wiki

This subchapter is the structural argument for why goal misgeneralization is a probable failure mode rather than a rare accident: