AI Safety Atlas Ch.7 — Mitigations
Source: Mitigations
Mitigation strategies for goal-misgeneralization, organized by lifecycle stage (training-time, post-training, deployment-time).
A Critical Constraint
Maintain strict separation between interpretability signals used for safety verification and those incorporated into training objectives. If models learn to evade detection methods used during training, those same methods become less effective for subsequent verification.
This is “the most forbidden technique” from Ch.3.4 (cot-monitoring-technique) generalized to all interpretability — analyze method, train only on output.
Training-Time Interventions
Mechanistic Training Stories
Don’t just specify behaviors — articulate which internal algorithms should emerge and why the training setup would produce them. Two components:
- Training goal — desired algorithmic approach
- Training rationale — why the setup produces this algorithm over proxies
Curriculum Learning
Shape the loss landscape by controlling complexity progression. Causally aligned curriculum learning maintains invariant optimal decision rules across stages, preventing proxy goals from persisting as complexity increases.
Interpretability-guided curriculum design monitors emerging representations in real-time, adjusting subsequent stages based on detected patterns.
Data Augmentation
Systematically vary potentially spurious features while keeping goal-relevant features constant. Synthetic data and procedural environment generation extend correlation-breaking. Fundamental challenge: anticipating which correlations to break without advance knowledge.
Latent Adversarial Training (LAT)
Tests robustness by disrupting internal representations rather than external inputs. Directly targets computational mechanisms enabling goal misgeneralization. Targeted LAT improves refusal robustness with reduced compute cost.
Myopic Training
Constrain systems to optimize only immediate objectives — preventing development of beyond-episode goals that enable strategic deception. Systems can’t engage in alignment faking when planning horizon doesn’t extend beyond individual responses.
This is a structural counter to scheming’s fourth prerequisite (beyond-episode goals).
Process-Oriented Training
Optimize reasoning steps rather than just final outputs — forcing systems to use appropriate causal reasoning instead of shortcuts. Includes:
- AI-assisted training where separate models analyze behaviors
- Relaxed adversarial training that makes transparency instrumentally valuable
Interpretability-Guided Training
Real-time analysis of model internals using SAEs, steering vectors, probes. When problematic patterns emerge, training can introduce corrective measures.
Concept Ablation Fine-Tuning (CAFT): identifies problematic activation directions and trains while projecting them out. CAFT reduced harmful responses by 10× in experiments while maintaining task performance. Works best when misaligned goals correspond to identifiable directions.
Post-Training Interventions
Steering Vectors
Linear directions in activation space corresponding to specific traits or objectives. Enable both detection (measuring projection) and control (steering along directions).
Model Editing and Unlearning
Surgical removal of goal-related representations without full retraining. Rank-One Model Editing, concept erasure. Limitation: cannot prevent models from redeveloping similar patterns.
See machine-unlearning.
Deployment-Time Mitigations
Runtime Monitoring
Three-axis monitoring:
- Behavioral — outputs suggesting misalignment
- Internal state — interpretability tools
- Environmental — unexpected effects indicating hidden goals
Filters and Guardrails
Constitutional AI filters evaluate outputs against alignment criteria. Multi-model verification systems use diverse AI to cross-check.
Sandboxing and Human-in-the-Loop
Capability restrictions; approval requirements for high-stakes decisions.
Circuit Breakers
Halt execution when goal misgeneralization is detected. Balance sensitivity (catch real threats) vs. false-alarm rates. See circuit-breakers.
Learning Theory Foundations
The theoretical reason goal misgeneralization is mathematically inevitable:
Classical learning theory assumes simpler models generalize better, using parameter count as complexity measure. This breaks down for overparametrized neural networks.
Singular Learning Theory reveals: gradient descent discovers algorithms reflecting which solutions were already probable under random sampling. The Free Energy Formula mathematically formalizes: simpler algorithms with fewer “effective parameters” systematically emerge during finite training.
→ Goal misgeneralization is mathematically inevitable rather than merely possible:
- Simple proxy goals occupy larger parameter space regions
- Discovery is more probable
- Implementing genuine alignment requires high algorithmic complexity
- Simple aligned-looking heuristics have lower complexity → more probable
Goal Crystallization
Phase transitions where networks abandon simple goal approximations for more complex ones. Monitoring phase transitions via the Local Learning Coefficient could provide early warning when systems shift toward potentially misaligned goals.
The Combined-Approaches Imperative
“No single intervention provides complete protection. Combining data augmentation with latent adversarial training addresses goal misgeneralization at multiple levels — environmental shortcuts and internal representational shortcuts simultaneously.”
This is defense-in-depth applied to inner alignment specifically.
Connection to Wiki
This subchapter is the technical applied-mitigations layer connecting Ch.7’s theoretical framing to operational AI safety work. Connections:
- goal-misgeneralization — substantially expanded
- defense-in-depth — applied to inner alignment
- machine-unlearning — post-training intervention
- cot-monitoring-technique — process supervision
- circuit-breakers — deployment-time mitigation
- interpretability — load-bearing
- ai-control — sandboxing and human-in-the-loop
- learning-dynamics — Singular Learning Theory connection