AI Safety Atlas Ch.5 — Evaluation Techniques

Source: Evaluation Techniques

Two complementary technique families: behavioral (analyze outputs, “black-box”) and internal (examine activations and computations, “gray-box”). Combined, they build stronger safety evidence than either alone. See evaluation-techniques.

Behavioral Techniques

Standard Prompting and Testing

Most basic form — predefined inputs, analyze outputs. Even “simple” testing requires careful question framing.

Elicitation and Scaffolding

Best-of-N Sampling — generate multiple responses, select the best per scoring criteria. Establishes upper bounds on capability. METR’s modular agent framework uses a four-step loop: Prompter → Generator → Discriminator → Actor.

Multi-Step Reasoning Prompting — “Let’s solve this step by step.” Variants: Chain of Thought (CoT), CoT-Self-Consistency, Tree of Thoughts, Graph of Thoughts. Reveals dormant capabilities.

Supervised Fine-Tuning — modify weights to consistently exhibit specific behaviors. Research using “password-locked models” showed fine-tuning recovers >90% of hidden performance with just a few examples. More reliable than prompting for capability elicitation.

Tool-Augmentation — give models access to external tools/APIs. Reveals capabilities through scaffolding.

Red Teaming — systematic attempts to elicit concerning behaviors. Used extensively in control evaluations.

Long-Term Interaction Studies — model behavior over extended interactions. DeepMind’s “Hidden Agenda” evaluation studies multi-turn manipulation.

Internal Techniques

Examine internal representations, activation patterns, computational mechanisms. Currently under development — most evaluations still use only behavioral techniques.

Enumerative Safety

Systematically catalog individual features (concepts) that neural networks represent internally. Differs from mechanistic interpretability (which traces computational pathways).

Sparse Autoencoders (SAEs)

Decompose activations into non-overlapping interpretable components. Anthropic extracted interpretable features from Claude 3 Sonnet revealing deception, sycophancy, and bias features. Used in alignment auditing to identify hidden objectives.

Representation Analysis (RepE)

Studies how models encode information internally. Places representations at analysis center (vs. neurons or circuits in mech interp).

Probes

Linear classifiers trained on internal activations to detect specific patterns. Researchers trained “coup probes” reliably flagging takeover-attempt patterns even with different wording or translation.

Probe limitation: depends on models not being dramatically more capable than humans in safety-critical domains. Extremely advanced systems might manipulate internal representations to fool probes (“gradient hacking”) — currently extremely difficult.

Alignment Audits

Systematic investigations into whether AI models pursue hidden objectives.

Anthropic’s “auditing game”: deliberately trained Claude 3.5 Haiku as RM-sycophant exploiting reward model biases. Four research teams investigated. Three teams with training-data access succeeded; one team without data access failed. Key insight: training-data access is load-bearing for hidden-objective detection.

The Poser benchmark — 324 LLaMA-13B model pairs, one aligned and one alignment-faker. Tested detection strategies:

“Intoxicating” via activation noise → 98% detection
Fine-tuning learning-rate comparison → 90% accuracy
Activation-direction analysis → 71% accuracy

Combined Defense in Depth

The Atlas’s stance: internal techniques complement behavioral methods — they don’t replace them. “De-correlated signals” improve detection vs. either alone.

Fundamental limitations of internal techniques:

Superposition (concepts distributed across neurons)
Inherent errors in current SAEs
Difficulty measuring progress without ground truth
“Long tail” of complex poorly-understood behaviors
Can’t prove absence of deceptive circuits

The goal: make deception harder and riskier for AI, not guarantee detection.

Connection to Wiki

evaluation-techniques — dedicated concept page
interpretability — internal-techniques umbrella
sparse-coding — SR2025 agenda for SAEs
lie-and-deception-detectors — SR2025 agenda
monitoring-concepts — SR2025 agenda
deceptive-alignment, ai-deception-evals, ai-scheming-evals — what these techniques try to detect
atlas-ch5-evaluations-03-evaluated-properties

AI Safety Compendium

Explorer

AI Safety Atlas Ch.5 — Evaluation Techniques

AI Safety Atlas Ch.5 — Evaluation Techniques

Behavioral Techniques

Standard Prompting and Testing

Elicitation and Scaffolding

Internal Techniques

Enumerative Safety

Sparse Autoencoders (SAEs)

Representation Analysis (RepE)

Probes

Alignment Audits

Combined Defense in Depth

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.5 — Evaluation Techniques

AI Safety Atlas Ch.5 — Evaluation Techniques

Behavioral Techniques

Standard Prompting and Testing

Elicitation and Scaffolding

Internal Techniques

Enumerative Safety

Sparse Autoencoders (SAEs)

Representation Analysis (RepE)

Probes

Alignment Audits

Combined Defense in Depth

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks