AI Safety Atlas Ch.7 — Mitigations

Source: Mitigations

Mitigation strategies for goal-misgeneralization, organized by lifecycle stage (training-time, post-training, deployment-time).

A Critical Constraint

Maintain strict separation between interpretability signals used for safety verification and those incorporated into training objectives. If models learn to evade detection methods used during training, those same methods become less effective for subsequent verification.

This is “the most forbidden technique” from Ch.3.4 (cot-monitoring-technique) generalized to all interpretability — analyze method, train only on output.

Training-Time Interventions

Mechanistic Training Stories

Don’t just specify behaviors — articulate which internal algorithms should emerge and why the training setup would produce them. Two components:

  1. Training goal — desired algorithmic approach
  2. Training rationale — why the setup produces this algorithm over proxies

Curriculum Learning

Shape the loss landscape by controlling complexity progression. Causally aligned curriculum learning maintains invariant optimal decision rules across stages, preventing proxy goals from persisting as complexity increases.

Interpretability-guided curriculum design monitors emerging representations in real-time, adjusting subsequent stages based on detected patterns.

Data Augmentation

Systematically vary potentially spurious features while keeping goal-relevant features constant. Synthetic data and procedural environment generation extend correlation-breaking. Fundamental challenge: anticipating which correlations to break without advance knowledge.

Latent Adversarial Training (LAT)

Tests robustness by disrupting internal representations rather than external inputs. Directly targets computational mechanisms enabling goal misgeneralization. Targeted LAT improves refusal robustness with reduced compute cost.

Myopic Training

Constrain systems to optimize only immediate objectives — preventing development of beyond-episode goals that enable strategic deception. Systems can’t engage in alignment faking when planning horizon doesn’t extend beyond individual responses.

This is a structural counter to scheming’s fourth prerequisite (beyond-episode goals).

Process-Oriented Training

Optimize reasoning steps rather than just final outputs — forcing systems to use appropriate causal reasoning instead of shortcuts. Includes:

  • AI-assisted training where separate models analyze behaviors
  • Relaxed adversarial training that makes transparency instrumentally valuable

Interpretability-Guided Training

Real-time analysis of model internals using SAEs, steering vectors, probes. When problematic patterns emerge, training can introduce corrective measures.

Concept Ablation Fine-Tuning (CAFT): identifies problematic activation directions and trains while projecting them out. CAFT reduced harmful responses by 10× in experiments while maintaining task performance. Works best when misaligned goals correspond to identifiable directions.

Post-Training Interventions

Steering Vectors

Linear directions in activation space corresponding to specific traits or objectives. Enable both detection (measuring projection) and control (steering along directions).

Model Editing and Unlearning

Surgical removal of goal-related representations without full retraining. Rank-One Model Editing, concept erasure. Limitation: cannot prevent models from redeveloping similar patterns.

See machine-unlearning.

Deployment-Time Mitigations

Runtime Monitoring

Three-axis monitoring:

  • Behavioral — outputs suggesting misalignment
  • Internal state — interpretability tools
  • Environmental — unexpected effects indicating hidden goals

Filters and Guardrails

Constitutional AI filters evaluate outputs against alignment criteria. Multi-model verification systems use diverse AI to cross-check.

Sandboxing and Human-in-the-Loop

Capability restrictions; approval requirements for high-stakes decisions.

Circuit Breakers

Halt execution when goal misgeneralization is detected. Balance sensitivity (catch real threats) vs. false-alarm rates. See circuit-breakers.

Learning Theory Foundations

The theoretical reason goal misgeneralization is mathematically inevitable:

Classical learning theory assumes simpler models generalize better, using parameter count as complexity measure. This breaks down for overparametrized neural networks.

Singular Learning Theory reveals: gradient descent discovers algorithms reflecting which solutions were already probable under random sampling. The Free Energy Formula mathematically formalizes: simpler algorithms with fewer “effective parameters” systematically emerge during finite training.

→ Goal misgeneralization is mathematically inevitable rather than merely possible:

  • Simple proxy goals occupy larger parameter space regions
  • Discovery is more probable
  • Implementing genuine alignment requires high algorithmic complexity
  • Simple aligned-looking heuristics have lower complexity → more probable

Goal Crystallization

Phase transitions where networks abandon simple goal approximations for more complex ones. Monitoring phase transitions via the Local Learning Coefficient could provide early warning when systems shift toward potentially misaligned goals.

The Combined-Approaches Imperative

“No single intervention provides complete protection. Combining data augmentation with latent adversarial training addresses goal misgeneralization at multiple levels — environmental shortcuts and internal representational shortcuts simultaneously.”

This is defense-in-depth applied to inner alignment specifically.

Connection to Wiki

This subchapter is the technical applied-mitigations layer connecting Ch.7’s theoretical framing to operational AI safety work. Connections: