AI Safety Atlas Ch.7 — Learning Dynamics

Source: Learning Dynamics

The training process guides which algorithms and goals are discovered. Results shaped by poorly understood preferences in the training system and by what’s easiest to compute. See learning-dynamics.

ML as Algorithm Search

Each parameter configuration corresponds to a different algorithm. Training searches algorithm space — the path determines which algorithms emerge.

Training often discovers algorithms that work for unexpected reasons:

“Thneeb” classification — models learn to identify objects by color rather than shape
100 identical BERT models trained on identical data with identical hyperparameters achieved nearly identical training performance but completely different reasoning approaches when tested on novel structures

Loss Landscapes

A loss landscape visualizes how performance changes across parameter configurations. “A fixed mountain range with peaks, valleys, ridges, and basins.” Random initialization places a model at a random starting point; gradient descent acts like a ball rolling downhill.

Geometry determines algorithmic discoverability:

Wide, flat valleys — robust, generalizable solutions stable under perturbation
Sharp, narrow peaks — brittle algorithms; tiny changes cause major performance drops
Wider valleys for misaligned goals → more likely to emerge from training

Path Dependence

Small training differences → fundamentally different algorithms.

High path dependence = high variance across runs
Low path dependence = consistent solutions

In complex landscapes with multiple deep valleys, starting position matters enormously. One starting position leads to “move right”; another discovers “collect coins” — completely different algorithms despite identical training performance.

Inductive Bias

Inductive biases = “systematic preferences of learning algorithms that favor certain types of solutions over others” — emerging from architecture, optimization, training setup rather than explicit programming.

Simplicity Bias (Most Important for Goal Misgeneralization)

Gradient descent favors algorithms relying on simple correlations over complex causal reasoning. Simple functions occupy vastly larger volumes in loss landscapes:

More ways to encode “move right”
Fewer ways to encode “navigate to coin-shaped objects using visual recognition and spatial reasoning”

Empirical: “simple functions are exponentially more likely to emerge than complex ones.”

Other Biases

Speed bias — favoring faster inference
Frequency bias — learning low-frequency patterns first

Safety Implications

Inductive biases create goal misgeneralization risks because correlation-based algorithms are typically simpler than causal reasoning:

“Helpful responses get approval” — simpler than
“Understand what humans actually need and provide that” — complex causal model

As training environments grow complex, the gap between intended goals and easily-learned proxies expands → misaligned algorithms become increasingly likely.

Connection to Wiki

This subchapter is the structural argument for why goal misgeneralization is a probable failure mode rather than a rare accident:

learning-dynamics — dedicated concept page
goal-misgeneralization — the core concept this enables
scaling-laws — adjacent ML dynamics
outer-vs-inner-alignment — Ch.7 covers inner alignment

AI Safety Compendium

Explorer

AI Safety Atlas Ch.7 — Learning Dynamics

AI Safety Atlas Ch.7 — Learning Dynamics

ML as Algorithm Search

Loss Landscapes

Path Dependence

Inductive Bias

Simplicity Bias (Most Important for Goal Misgeneralization)

Other Biases

Safety Implications

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.7 — Learning Dynamics

AI Safety Atlas Ch.7 — Learning Dynamics

ML as Algorithm Search

Loss Landscapes

Path Dependence

Inductive Bias

Simplicity Bias (Most Important for Goal Misgeneralization)

Other Biases

Safety Implications

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks