Learning Dynamics
Learning dynamics covers how the training process shapes which algorithms emerge in neural networks. Far from being neutral optimization, training contains poorly-understood preferences that make some algorithms (especially simple proxy-based ones) systematically more likely to emerge than others.
The AI Safety Atlas (Ch.7.1) treats learning dynamics as the structural foundation for goal-misgeneralization — explaining why goal misgeneralization is mathematically probable.
ML as Algorithm Search
Each parameter configuration corresponds to a different algorithm. Training searches algorithm space — the path determines which algorithms emerge.
Concrete demonstration: 100 identical BERT models trained on identical data with identical hyperparameters achieved nearly identical training performance, but revealed completely different reasoning approaches when tested on novel structures. Same setup, divergent algorithms.
Loss Landscapes
Visualize neural network performance across parameter configurations as a topographic map: “a fixed mountain range with peaks, valleys, ridges, and basins.”
- Wide flat valleys — robust generalizable solutions stable under perturbation
- Sharp narrow peaks — brittle algorithms; tiny changes cause major performance drops
Critical for safety: wider valleys for misaligned goals make those goals more likely to emerge from training.
Path Dependence
Small training differences → fundamentally different algorithms.
- High path dependence = high variance across runs
- Low path dependence = consistent solutions
In complex landscapes with multiple deep valleys, starting position matters enormously. Identical setups can produce dramatically different behaviors.
Inductive Bias
Systematic preferences of learning algorithms favoring certain solution types — emerging from architecture, optimization procedures, training setup, not explicit programming.
Simplicity Bias (Most Important for Safety)
Gradient descent favors algorithms relying on simple correlations over complex causal reasoning.
Why: simple functions occupy vastly larger volumes in loss landscapes. Empirically, “simple functions are exponentially more likely to emerge than complex ones.”
Image classifiers: learn texture-based strategies over shape-based ones because texture patterns require simpler computational structures. Same logic applies to alignment:
- “Helpful responses get approval” — algorithmically simple
- “Understand what humans actually need and provide that” — complex causal model
Other Biases
- Speed bias — favoring faster inference
- Frequency bias — learning low-frequency patterns first
Singular Learning Theory
Classical learning theory assumes simpler models generalize better, using parameter count as complexity. This breaks down for overparametrized neural networks — multiple parameter configurations implement identical algorithms.
Singular Learning Theory reveals: gradient descent discovers algorithms reflecting which solutions were already probable under random sampling. The Free Energy Formula mathematically formalizes:
- Simpler algorithms with fewer “effective parameters” systematically emerge during finite training
- Even when more complex algorithms better capture intended goals
Per the Atlas: this makes goal misgeneralization mathematically inevitable rather than merely possible.
Goal Crystallization
Phase transitions where networks abandon simple goal approximations for more complex ones. Monitoring phase transitions via the Local Learning Coefficient could provide early warning when systems shift toward potentially misaligned goals.
Why Learning Dynamics Make Alignment Hard
Learning dynamics give the structural argument for why simply specifying the right reward isn’t enough:
- Simplicity bias — simple proxies more likely than complex causal reasoning
- Path dependence — same training can yield very different algorithms
- Loss landscape geometry — misaligned-goal valleys can be wider than aligned ones
- Singular learning theory — proxy goals occupy larger probability mass
These structural features make goal-misgeneralization expected rather than surprising.
Connection to Wiki
- goal-misgeneralization — the failure mode learning dynamics enables
- outer-vs-inner-alignment — Ch.7’s inner-alignment frame
- scaling-laws — adjacent ML dynamics
- interpretability — what would let us see the algorithms emerging
- learning-dynamics-and-developmental-interpretability — SR2025 agenda for studying these dynamics
- ai-safety-atlas-textbook — primary source
Related Pages
- goal-misgeneralization
- mesa-optimization
- goal-directedness
- outer-vs-inner-alignment
- scaling-laws
- interpretability
- learning-dynamics-and-developmental-interpretability
- ai-safety-atlas-textbook
- atlas-ch7-goal-misgeneralization-01-learning-dynamics
- atlas-ch7-goal-misgeneralization-06-mitigations
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- AI Safety Atlas Ch.7 — Learning Dynamics — referenced as
[[atlas-ch7-goal-misgeneralization-01-learning-dynamics]] - AI Safety Atlas Ch.7 — Mitigations — referenced as
[[atlas-ch7-goal-misgeneralization-06-mitigations]]