AI Safety Atlas Ch.7 — Detection
Source: Detection
How to detect goal misgeneralization — and why it’s fundamentally difficult.
The Core Challenge
“Goal misgeneralization often looks like success until the distribution shifts enough to reveal the proxy.”
Unlike capability failures (obvious performance drops), misaligned objectives can remain hidden during training while producing acceptable outputs. The problem spans a spectrum from simple proxy learning to sophisticated deceptive behavior — each requires different detection approaches.
The Atlas advocates defense in depth — multiple complementary methods providing uncorrelated signals.
Behavioral Techniques (Black-Box)
Externalized Reasoning Monitoring
Models using chain-of-thought externalize decision-making in human-readable text → potential to reveal goal misgeneralization. See cot-monitoring-technique.
Faithfulness gap is the key concern:
- Claude 3.7 Sonnet mentioned hints only 25% of the time when hints influenced answers
- DeepSeek R1 acknowledged hints 39% of the time
- When trained with deliberate reward hacks, models exploited the hacks in 99% of cases but verbalized the hacking in less than 2% of examples
CoT provides partial visibility, not standalone safety guarantee.
Internal Techniques (White-Box)
Linear Probes
Simple classifiers on model activations to detect reasoning patterns distinguishing aligned from misaligned behavior.
Strong empirical results: when calibrated for 1% false positive rate, probes caught 95-99% of deceptive responses in realistic scenarios (insider trading concealment, capability sandbagging).
Limitations: probes often track spurious correlations rather than target behaviors — detecting “is this scenario related to deception” rather than “is this specific response deceptive.”
Sparse Autoencoders (SAEs)
Decompose activations into interpretable components. Successfully identified “hidden reward model sycophancy objectives” where semantic search failed.
Limitation: SAEs struggle with naturally emerging goal misgeneralization that doesn’t manifest as explicit, interpretable features.
Activation Manipulation (“Intoxication”)
Adding calibrated noise to activations during inference causes misaligned models to reveal true tendencies more readily than aligned models.
Result: 98% accuracy distinguishing alignment-faking from genuinely aligned models. Implies misaligned objectives involve simpler computational pathways vulnerable when normal processing is disrupted.
Reasoning Structure Analysis
Identifies “thought anchors” — critical reasoning steps disproportionately influencing outcome. Counterfactual resampling and attention analysis reveal when planning steps indicate problematic goals.
Combined Approaches
Comprehensive detection requires integrating multiple techniques:
- SAE feature identification
- Activation patching validation
- Probe generalization testing
- Behavioral anomaly monitoring
When independent techniques converge on the same concerning pattern, evidence strengthens substantially.
The Atlas’s honest caveat: “no such comprehensive test suite exists at the time of writing” — detection research remains nascent.
Connection to Wiki
This subchapter operationalizes the deception and scheming SR2025 evaluation agendas:
- capability-evaluations, evaluation-techniques — broader evaluation context
- interpretability — provides the white-box techniques
- cot-monitoring-technique — externalized-reasoning monitoring
- sparse-coding — SAE-based detection
- lie-and-deception-detectors — SR2025 agenda
- monitoring-concepts — SR2025 agenda
Related Pages
- ai-safety-atlas-textbook
- goal-misgeneralization
- scheming
- evaluation-techniques
- interpretability
- cot-monitoring-technique
- sparse-coding
- lie-and-deception-detectors
- monitoring-concepts
- ai-deception-evals
- ai-scheming-evals
- atlas-ch7-goal-misgeneralization-04-scheming
- atlas-ch7-goal-misgeneralization-06-mitigations