Evaluation Techniques

Two complementary technique families for AI safety evaluation: behavioral (analyze outputs, “black-box”) and internal (examine activations and computations, “gray-box”). Together they build stronger safety evidence than either alone.

The AI Safety Atlas (Ch.5.2) treats this as the technical menu for evaluating capabilities, propensities, and control.

Behavioral Techniques

Examine AI systems through observable outputs. Also called black-box or input-output (IO) evaluations.

Standard Prompting and Testing

Predefined inputs, analyze outputs. The baseline. Even simple testing requires careful question framing.

Elicitation and Scaffolding

Active drawing-out of capabilities that may not be immediately apparent.

Best-of-N Sampling

Generate N completions, select best. Establishes upper bounds on capability. METR’s modular agent framework uses this in a four-step loop: Prompter → Generator → Discriminator → Actor.

Multi-Step Reasoning Prompting

“Let’s solve this step by step.” Variants: Chain of Thought (CoT), CoT-Self-Consistency, Tree of Thoughts (ToT), Graph of Thoughts (GoT). Reveals capabilities models possess but might not show in direct testing.

Supervised Fine-Tuning (SFT)

More reliable than prompting for capability elicitation. Modifies weights to consistently exhibit specific behaviors. Research using “password-locked models” (deliberately trained to perform well only with specific prompts) showed fine-tuning recovers >90% of hidden performance with just a few examples.

SFT limitations: models starting too far from target may not converge; requires target-behavior examples; can’t assess capability expansion through external tools.

Tool-Augmentation

Provide access to external tools/APIs as scaffolding. METR’s autonomous replication evaluation tests complex real-world tasks this way.

Red Teaming

Systematic attempts to elicit concerning behaviors. Used extensively in control-evaluations.

Long-Term Interaction Studies

Behavior over extended interactions. DeepMind’s “Hidden Agenda” evaluation studies multi-turn manipulation.

Internal Techniques

Study internal representations, activation patterns, computational mechanisms. Currently under development — most evaluations still use only behavioral techniques. Maps closely to interpretability research.

Enumerative Safety

Systematically catalog individual features (concepts) neural networks represent internally. Different from mechanistic interpretability (computational pathways) and developmental interpretability (feature formation during training).

Sparse Autoencoders (SAEs)

Decompose activations into non-overlapping interpretable components. Anthropic extracted interpretable features from Claude 3 Sonnet revealing deception, sycophancy, and bias features. Connects to SR2025 sparse-coding agenda.

Representation Engineering (RepE)

Studies how models encode information internally. Places representations at analysis center.

Probes

Linear classifiers on internal activations detecting specific patterns. Researchers trained “coup probes” reliably flagging takeover-attempt patterns even with different wording or translation.

Limitation: probe effectiveness depends on models not being dramatically more capable than humans in safety-critical domains. Extremely advanced systems might manipulate internal representations to fool probes (“gradient hacking”) — currently extremely difficult.

Alignment Audits

Systematic investigations into hidden objectives. Anthropic’s “auditing game”: deliberately trained Claude 3.5 Haiku as RM-sycophant; four research teams investigated. Three teams with training-data access succeeded; one team without failed. Training-data access is load-bearing.

The Poser benchmark: 324 LLaMA-13B model pairs (one aligned, one alignment-faker). Detection strategies tested:

Activation-noise injection (“intoxication”) → 98% detection
Fine-tuning learning rate comparison → 90% accuracy
Activation-direction analysis → 71% accuracy

Combined Defense in Depth

The Atlas’s stance: behavioral and internal techniques complement each other through de-correlated signals. Different failure modes detected by different methods.

Internal-techniques limitations:

Superposition (concepts distributed across neurons)
Inherent errors in current SAEs
Difficulty measuring progress without ground truth
“Long tail” of complex poorly-understood behaviors
Cannot prove absence of deceptive circuits

The goal: make deception harder and riskier for AI, not guarantee detection.

Connection to Wiki

evaluated-properties — what these techniques evaluate
evaluation-design — how to deploy these techniques
interpretability — internal techniques are interpretability applied
sparse-coding, lie-and-deception-detectors, monitoring-concepts — SR2025 agendas matching internal techniques
various-redteams — SR2025 agenda for behavioral red-teaming
capability-evaluations — operational deployment

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

AI Safety Atlas Ch.5 — Evaluation Techniques — referenced as [[atlas-ch5-evaluations-02-evaluation-techniques]]

AI Safety Compendium

Explorer

Evaluation Techniques

Evaluation Techniques

Behavioral Techniques

Standard Prompting and Testing

Elicitation and Scaffolding

Best-of-N Sampling

Multi-Step Reasoning Prompting

Supervised Fine-Tuning (SFT)

Tool-Augmentation

Red Teaming

Long-Term Interaction Studies

Internal Techniques

Enumerative Safety

Sparse Autoencoders (SAEs)

Representation Engineering (RepE)

Probes

Alignment Audits

Combined Defense in Depth

Connection to Wiki

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Evaluation Techniques

Evaluation Techniques

Behavioral Techniques

Standard Prompting and Testing

Elicitation and Scaffolding

Best-of-N Sampling

Multi-Step Reasoning Prompting

Supervised Fine-Tuning (SFT)

Tool-Augmentation

Red Teaming

Long-Term Interaction Studies

Internal Techniques

Enumerative Safety

Sparse Autoencoders (SAEs)

Representation Engineering (RepE)

Probes

Alignment Audits

Combined Defense in Depth

Connection to Wiki

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks