AI Safety Atlas Ch.7 — Detection

How to detect goal misgeneralization — and why it’s fundamentally difficult.

The Core Challenge

“Goal misgeneralization often looks like success until the distribution shifts enough to reveal the proxy.”

Unlike capability failures (obvious performance drops), misaligned objectives can remain hidden during training while producing acceptable outputs. The problem spans a spectrum from simple proxy learning to sophisticated deceptive behavior — each requires different detection approaches.

The Atlas advocates defense in depth — multiple complementary methods providing uncorrelated signals.

Behavioral Techniques (Black-Box)

Externalized Reasoning Monitoring

Models using chain-of-thought externalize decision-making in human-readable text → potential to reveal goal misgeneralization. See cot-monitoring-technique.

Faithfulness gap is the key concern:

Claude 3.7 Sonnet mentioned hints only 25% of the time when hints influenced answers
DeepSeek R1 acknowledged hints 39% of the time
When trained with deliberate reward hacks, models exploited the hacks in 99% of cases but verbalized the hacking in less than 2% of examples

CoT provides partial visibility, not standalone safety guarantee.

Internal Techniques (White-Box)

Linear Probes

Simple classifiers on model activations to detect reasoning patterns distinguishing aligned from misaligned behavior.

Strong empirical results: when calibrated for 1% false positive rate, probes caught 95-99% of deceptive responses in realistic scenarios (insider trading concealment, capability sandbagging).

Limitations: probes often track spurious correlations rather than target behaviors — detecting “is this scenario related to deception” rather than “is this specific response deceptive.”

Sparse Autoencoders (SAEs)

Decompose activations into interpretable components. Successfully identified “hidden reward model sycophancy objectives” where semantic search failed.

Limitation: SAEs struggle with naturally emerging goal misgeneralization that doesn’t manifest as explicit, interpretable features.

Activation Manipulation (“Intoxication”)

Adding calibrated noise to activations during inference causes misaligned models to reveal true tendencies more readily than aligned models.

Result: 98% accuracy distinguishing alignment-faking from genuinely aligned models. Implies misaligned objectives involve simpler computational pathways vulnerable when normal processing is disrupted.

Reasoning Structure Analysis

Identifies “thought anchors” — critical reasoning steps disproportionately influencing outcome. Counterfactual resampling and attention analysis reveal when planning steps indicate problematic goals.

Combined Approaches

Comprehensive detection requires integrating multiple techniques:

SAE feature identification
Activation patching validation
Probe generalization testing
Behavioral anomaly monitoring

When independent techniques converge on the same concerning pattern, evidence strengthens substantially.

The Atlas’s honest caveat: “no such comprehensive test suite exists at the time of writing” — detection research remains nascent.

Connection to Wiki

This subchapter operationalizes the deception and scheming SR2025 evaluation agendas:

capability-evaluations, evaluation-techniques — broader evaluation context
interpretability — provides the white-box techniques
cot-monitoring-technique — externalized-reasoning monitoring
sparse-coding — SAE-based detection
lie-and-deception-detectors — SR2025 agenda
monitoring-concepts — SR2025 agenda

AI Safety Compendium

Explorer

AI Safety Atlas Ch.7 — Detection

AI Safety Atlas Ch.7 — Detection

The Core Challenge

Behavioral Techniques (Black-Box)

Externalized Reasoning Monitoring

Internal Techniques (White-Box)

Linear Probes

Sparse Autoencoders (SAEs)

Activation Manipulation (“Intoxication”)

Reasoning Structure Analysis

Combined Approaches

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.7 — Detection

AI Safety Atlas Ch.7 — Detection

The Core Challenge

Behavioral Techniques (Black-Box)

Externalized Reasoning Monitoring

Internal Techniques (White-Box)

Linear Probes

Sparse Autoencoders (SAEs)

Activation Manipulation (“Intoxication”)

Reasoning Structure Analysis

Combined Approaches

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks