AI Safety Atlas Ch.7 — Mitigations

Source: Mitigations

Mitigation strategies for goal-misgeneralization, organized by lifecycle stage (training-time, post-training, deployment-time).

A Critical Constraint

Maintain strict separation between interpretability signals used for safety verification and those incorporated into training objectives. If models learn to evade detection methods used during training, those same methods become less effective for subsequent verification.

This is “the most forbidden technique” from Ch.3.4 (cot-monitoring-technique) generalized to all interpretability — analyze method, train only on output.

Training-Time Interventions

Mechanistic Training Stories

Don’t just specify behaviors — articulate which internal algorithms should emerge and why the training setup would produce them. Two components:

Training goal — desired algorithmic approach
Training rationale — why the setup produces this algorithm over proxies

Curriculum Learning

Shape the loss landscape by controlling complexity progression. Causally aligned curriculum learning maintains invariant optimal decision rules across stages, preventing proxy goals from persisting as complexity increases.

Interpretability-guided curriculum design monitors emerging representations in real-time, adjusting subsequent stages based on detected patterns.

Data Augmentation

Systematically vary potentially spurious features while keeping goal-relevant features constant. Synthetic data and procedural environment generation extend correlation-breaking. Fundamental challenge: anticipating which correlations to break without advance knowledge.

Latent Adversarial Training (LAT)

Tests robustness by disrupting internal representations rather than external inputs. Directly targets computational mechanisms enabling goal misgeneralization. Targeted LAT improves refusal robustness with reduced compute cost.

Myopic Training

Constrain systems to optimize only immediate objectives — preventing development of beyond-episode goals that enable strategic deception. Systems can’t engage in alignment faking when planning horizon doesn’t extend beyond individual responses.

This is a structural counter to scheming’s fourth prerequisite (beyond-episode goals).

Process-Oriented Training

Optimize reasoning steps rather than just final outputs — forcing systems to use appropriate causal reasoning instead of shortcuts. Includes:

AI-assisted training where separate models analyze behaviors
Relaxed adversarial training that makes transparency instrumentally valuable

Interpretability-Guided Training

Real-time analysis of model internals using SAEs, steering vectors, probes. When problematic patterns emerge, training can introduce corrective measures.

Concept Ablation Fine-Tuning (CAFT): identifies problematic activation directions and trains while projecting them out. CAFT reduced harmful responses by 10× in experiments while maintaining task performance. Works best when misaligned goals correspond to identifiable directions.

Post-Training Interventions

Steering Vectors

Linear directions in activation space corresponding to specific traits or objectives. Enable both detection (measuring projection) and control (steering along directions).

Model Editing and Unlearning

Surgical removal of goal-related representations without full retraining. Rank-One Model Editing, concept erasure. Limitation: cannot prevent models from redeveloping similar patterns.

See machine-unlearning.

Deployment-Time Mitigations

Runtime Monitoring

Three-axis monitoring:

Behavioral — outputs suggesting misalignment
Internal state — interpretability tools
Environmental — unexpected effects indicating hidden goals

Filters and Guardrails

Constitutional AI filters evaluate outputs against alignment criteria. Multi-model verification systems use diverse AI to cross-check.

Sandboxing and Human-in-the-Loop

Capability restrictions; approval requirements for high-stakes decisions.

Circuit Breakers

Halt execution when goal misgeneralization is detected. Balance sensitivity (catch real threats) vs. false-alarm rates. See circuit-breakers.

Learning Theory Foundations

The theoretical reason goal misgeneralization is mathematically inevitable:

Classical learning theory assumes simpler models generalize better, using parameter count as complexity measure. This breaks down for overparametrized neural networks.

Singular Learning Theory reveals: gradient descent discovers algorithms reflecting which solutions were already probable under random sampling. The Free Energy Formula mathematically formalizes: simpler algorithms with fewer “effective parameters” systematically emerge during finite training.

→ Goal misgeneralization is mathematically inevitable rather than merely possible:

Simple proxy goals occupy larger parameter space regions
Discovery is more probable
Implementing genuine alignment requires high algorithmic complexity
Simple aligned-looking heuristics have lower complexity → more probable

Goal Crystallization

Phase transitions where networks abandon simple goal approximations for more complex ones. Monitoring phase transitions via the Local Learning Coefficient could provide early warning when systems shift toward potentially misaligned goals.

The Combined-Approaches Imperative

“No single intervention provides complete protection. Combining data augmentation with latent adversarial training addresses goal misgeneralization at multiple levels — environmental shortcuts and internal representational shortcuts simultaneously.”

This is defense-in-depth applied to inner alignment specifically.

Connection to Wiki

This subchapter is the technical applied-mitigations layer connecting Ch.7’s theoretical framing to operational AI safety work. Connections:

goal-misgeneralization — substantially expanded
defense-in-depth — applied to inner alignment
machine-unlearning — post-training intervention
cot-monitoring-technique — process supervision
circuit-breakers — deployment-time mitigation
interpretability — load-bearing
ai-control — sandboxing and human-in-the-loop
learning-dynamics — Singular Learning Theory connection

AI Safety Compendium

Explorer

AI Safety Atlas Ch.7 — Mitigations

AI Safety Atlas Ch.7 — Mitigations

A Critical Constraint

Training-Time Interventions

Mechanistic Training Stories

Curriculum Learning

Data Augmentation

Latent Adversarial Training (LAT)

Myopic Training

Process-Oriented Training

Interpretability-Guided Training

Post-Training Interventions

Steering Vectors

Model Editing and Unlearning

Deployment-Time Mitigations

Runtime Monitoring

Filters and Guardrails

Sandboxing and Human-in-the-Loop

Circuit Breakers

Learning Theory Foundations

Goal Crystallization

The Combined-Approaches Imperative

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.7 — Mitigations

AI Safety Atlas Ch.7 — Mitigations

A Critical Constraint

Training-Time Interventions

Mechanistic Training Stories

Curriculum Learning

Data Augmentation

Latent Adversarial Training (LAT)

Myopic Training

Process-Oriented Training

Interpretability-Guided Training

Post-Training Interventions

Steering Vectors

Model Editing and Unlearning

Deployment-Time Mitigations

Runtime Monitoring

Filters and Guardrails

Sandboxing and Human-in-the-Loop

Circuit Breakers

Learning Theory Foundations

Goal Crystallization

The Combined-Approaches Imperative

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks