Learning Dynamics

Learning dynamics covers how the training process shapes which algorithms emerge in neural networks. Far from being neutral optimization, training contains poorly-understood preferences that make some algorithms (especially simple proxy-based ones) systematically more likely to emerge than others.

The AI Safety Atlas (Ch.7.1) treats learning dynamics as the structural foundation for goal-misgeneralization — explaining why goal misgeneralization is mathematically probable.

Each parameter configuration corresponds to a different algorithm. Training searches algorithm space — the path determines which algorithms emerge.

Concrete demonstration: 100 identical BERT models trained on identical data with identical hyperparameters achieved nearly identical training performance, but revealed completely different reasoning approaches when tested on novel structures. Same setup, divergent algorithms.

Loss Landscapes

Visualize neural network performance across parameter configurations as a topographic map: “a fixed mountain range with peaks, valleys, ridges, and basins.”

  • Wide flat valleys — robust generalizable solutions stable under perturbation
  • Sharp narrow peaks — brittle algorithms; tiny changes cause major performance drops

Critical for safety: wider valleys for misaligned goals make those goals more likely to emerge from training.

Path Dependence

Small training differences → fundamentally different algorithms.

  • High path dependence = high variance across runs
  • Low path dependence = consistent solutions

In complex landscapes with multiple deep valleys, starting position matters enormously. Identical setups can produce dramatically different behaviors.

Inductive Bias

Systematic preferences of learning algorithms favoring certain solution types — emerging from architecture, optimization procedures, training setup, not explicit programming.

Simplicity Bias (Most Important for Safety)

Gradient descent favors algorithms relying on simple correlations over complex causal reasoning.

Why: simple functions occupy vastly larger volumes in loss landscapes. Empirically, “simple functions are exponentially more likely to emerge than complex ones.”

Image classifiers: learn texture-based strategies over shape-based ones because texture patterns require simpler computational structures. Same logic applies to alignment:

  • “Helpful responses get approval” — algorithmically simple
  • “Understand what humans actually need and provide that” — complex causal model

Other Biases

  • Speed bias — favoring faster inference
  • Frequency bias — learning low-frequency patterns first

Singular Learning Theory

Classical learning theory assumes simpler models generalize better, using parameter count as complexity. This breaks down for overparametrized neural networks — multiple parameter configurations implement identical algorithms.

Singular Learning Theory reveals: gradient descent discovers algorithms reflecting which solutions were already probable under random sampling. The Free Energy Formula mathematically formalizes:

  • Simpler algorithms with fewer “effective parameters” systematically emerge during finite training
  • Even when more complex algorithms better capture intended goals

Per the Atlas: this makes goal misgeneralization mathematically inevitable rather than merely possible.

Goal Crystallization

Phase transitions where networks abandon simple goal approximations for more complex ones. Monitoring phase transitions via the Local Learning Coefficient could provide early warning when systems shift toward potentially misaligned goals.

Why Learning Dynamics Make Alignment Hard

Learning dynamics give the structural argument for why simply specifying the right reward isn’t enough:

  • Simplicity bias — simple proxies more likely than complex causal reasoning
  • Path dependence — same training can yield very different algorithms
  • Loss landscape geometry — misaligned-goal valleys can be wider than aligned ones
  • Singular learning theory — proxy goals occupy larger probability mass

These structural features make goal-misgeneralization expected rather than surprising.

Connection to Wiki

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.