Evaluation Techniques
Two complementary technique families for AI safety evaluation: behavioral (analyze outputs, “black-box”) and internal (examine activations and computations, “gray-box”). Together they build stronger safety evidence than either alone.
The AI Safety Atlas (Ch.5.2) treats this as the technical menu for evaluating capabilities, propensities, and control.
Behavioral Techniques
Examine AI systems through observable outputs. Also called black-box or input-output (IO) evaluations.
Standard Prompting and Testing
Predefined inputs, analyze outputs. The baseline. Even simple testing requires careful question framing.
Elicitation and Scaffolding
Active drawing-out of capabilities that may not be immediately apparent.
Best-of-N Sampling
Generate N completions, select best. Establishes upper bounds on capability. METR’s modular agent framework uses this in a four-step loop: Prompter → Generator → Discriminator → Actor.
Multi-Step Reasoning Prompting
“Let’s solve this step by step.” Variants: Chain of Thought (CoT), CoT-Self-Consistency, Tree of Thoughts (ToT), Graph of Thoughts (GoT). Reveals capabilities models possess but might not show in direct testing.
Supervised Fine-Tuning (SFT)
More reliable than prompting for capability elicitation. Modifies weights to consistently exhibit specific behaviors. Research using “password-locked models” (deliberately trained to perform well only with specific prompts) showed fine-tuning recovers >90% of hidden performance with just a few examples.
SFT limitations: models starting too far from target may not converge; requires target-behavior examples; can’t assess capability expansion through external tools.
Tool-Augmentation
Provide access to external tools/APIs as scaffolding. METR’s autonomous replication evaluation tests complex real-world tasks this way.
Red Teaming
Systematic attempts to elicit concerning behaviors. Used extensively in control-evaluations.
Long-Term Interaction Studies
Behavior over extended interactions. DeepMind’s “Hidden Agenda” evaluation studies multi-turn manipulation.
Internal Techniques
Study internal representations, activation patterns, computational mechanisms. Currently under development — most evaluations still use only behavioral techniques. Maps closely to interpretability research.
Enumerative Safety
Systematically catalog individual features (concepts) neural networks represent internally. Different from mechanistic interpretability (computational pathways) and developmental interpretability (feature formation during training).
Sparse Autoencoders (SAEs)
Decompose activations into non-overlapping interpretable components. Anthropic extracted interpretable features from Claude 3 Sonnet revealing deception, sycophancy, and bias features. Connects to SR2025 sparse-coding agenda.
Representation Engineering (RepE)
Studies how models encode information internally. Places representations at analysis center.
Probes
Linear classifiers on internal activations detecting specific patterns. Researchers trained “coup probes” reliably flagging takeover-attempt patterns even with different wording or translation.
Limitation: probe effectiveness depends on models not being dramatically more capable than humans in safety-critical domains. Extremely advanced systems might manipulate internal representations to fool probes (“gradient hacking”) — currently extremely difficult.
Alignment Audits
Systematic investigations into hidden objectives. Anthropic’s “auditing game”: deliberately trained Claude 3.5 Haiku as RM-sycophant; four research teams investigated. Three teams with training-data access succeeded; one team without failed. Training-data access is load-bearing.
The Poser benchmark: 324 LLaMA-13B model pairs (one aligned, one alignment-faker). Detection strategies tested:
- Activation-noise injection (“intoxication”) → 98% detection
- Fine-tuning learning rate comparison → 90% accuracy
- Activation-direction analysis → 71% accuracy
Combined Defense in Depth
The Atlas’s stance: behavioral and internal techniques complement each other through de-correlated signals. Different failure modes detected by different methods.
Internal-techniques limitations:
- Superposition (concepts distributed across neurons)
- Inherent errors in current SAEs
- Difficulty measuring progress without ground truth
- “Long tail” of complex poorly-understood behaviors
- Cannot prove absence of deceptive circuits
The goal: make deception harder and riskier for AI, not guarantee detection.
Connection to Wiki
- evaluated-properties — what these techniques evaluate
- evaluation-design — how to deploy these techniques
- interpretability — internal techniques are interpretability applied
- sparse-coding, lie-and-deception-detectors, monitoring-concepts — SR2025 agendas matching internal techniques
- various-redteams — SR2025 agenda for behavioral red-teaming
- capability-evaluations — operational deployment
Related Pages
- evaluated-properties
- evaluation-design
- evaluation-frameworks
- interpretability
- sparse-coding
- lie-and-deception-detectors
- monitoring-concepts
- various-redteams
- capability-evaluations
- deceptive-alignment
- ai-deception-evals
- ai-safety-atlas-textbook
- atlas-ch5-evaluations-02-evaluation-techniques
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- AI Safety Atlas Ch.5 — Evaluation Techniques — referenced as
[[atlas-ch5-evaluations-02-evaluation-techniques]]