AI Safety Atlas Ch.5 — Evaluation Techniques
Source: Evaluation Techniques
Two complementary technique families: behavioral (analyze outputs, “black-box”) and internal (examine activations and computations, “gray-box”). Combined, they build stronger safety evidence than either alone. See evaluation-techniques.
Behavioral Techniques
Standard Prompting and Testing
Most basic form — predefined inputs, analyze outputs. Even “simple” testing requires careful question framing.
Elicitation and Scaffolding
Best-of-N Sampling — generate multiple responses, select the best per scoring criteria. Establishes upper bounds on capability. METR’s modular agent framework uses a four-step loop: Prompter → Generator → Discriminator → Actor.
Multi-Step Reasoning Prompting — “Let’s solve this step by step.” Variants: Chain of Thought (CoT), CoT-Self-Consistency, Tree of Thoughts, Graph of Thoughts. Reveals dormant capabilities.
Supervised Fine-Tuning — modify weights to consistently exhibit specific behaviors. Research using “password-locked models” showed fine-tuning recovers >90% of hidden performance with just a few examples. More reliable than prompting for capability elicitation.
Tool-Augmentation — give models access to external tools/APIs. Reveals capabilities through scaffolding.
Red Teaming — systematic attempts to elicit concerning behaviors. Used extensively in control evaluations.
Long-Term Interaction Studies — model behavior over extended interactions. DeepMind’s “Hidden Agenda” evaluation studies multi-turn manipulation.
Internal Techniques
Examine internal representations, activation patterns, computational mechanisms. Currently under development — most evaluations still use only behavioral techniques.
Enumerative Safety
Systematically catalog individual features (concepts) that neural networks represent internally. Differs from mechanistic interpretability (which traces computational pathways).
Sparse Autoencoders (SAEs)
Decompose activations into non-overlapping interpretable components. Anthropic extracted interpretable features from Claude 3 Sonnet revealing deception, sycophancy, and bias features. Used in alignment auditing to identify hidden objectives.
Representation Analysis (RepE)
Studies how models encode information internally. Places representations at analysis center (vs. neurons or circuits in mech interp).
Probes
Linear classifiers trained on internal activations to detect specific patterns. Researchers trained “coup probes” reliably flagging takeover-attempt patterns even with different wording or translation.
Probe limitation: depends on models not being dramatically more capable than humans in safety-critical domains. Extremely advanced systems might manipulate internal representations to fool probes (“gradient hacking”) — currently extremely difficult.
Alignment Audits
Systematic investigations into whether AI models pursue hidden objectives.
Anthropic’s “auditing game”: deliberately trained Claude 3.5 Haiku as RM-sycophant exploiting reward model biases. Four research teams investigated. Three teams with training-data access succeeded; one team without data access failed. Key insight: training-data access is load-bearing for hidden-objective detection.
The Poser benchmark — 324 LLaMA-13B model pairs, one aligned and one alignment-faker. Tested detection strategies:
- “Intoxicating” via activation noise → 98% detection
- Fine-tuning learning-rate comparison → 90% accuracy
- Activation-direction analysis → 71% accuracy
Combined Defense in Depth
The Atlas’s stance: internal techniques complement behavioral methods — they don’t replace them. “De-correlated signals” improve detection vs. either alone.
Fundamental limitations of internal techniques:
- Superposition (concepts distributed across neurons)
- Inherent errors in current SAEs
- Difficulty measuring progress without ground truth
- “Long tail” of complex poorly-understood behaviors
- Can’t prove absence of deceptive circuits
The goal: make deception harder and riskier for AI, not guarantee detection.
Connection to Wiki
- evaluation-techniques — dedicated concept page
- interpretability — internal-techniques umbrella
- sparse-coding — SR2025 agenda for SAEs
- lie-and-deception-detectors — SR2025 agenda
- monitoring-concepts — SR2025 agenda
- deceptive-alignment, ai-deception-evals, ai-scheming-evals — what these techniques try to detect
- atlas-ch5-evaluations-03-evaluated-properties