AI Safety Atlas Ch.5 — Evaluation Techniques

Source: Evaluation Techniques

Two complementary technique families: behavioral (analyze outputs, “black-box”) and internal (examine activations and computations, “gray-box”). Combined, they build stronger safety evidence than either alone. See evaluation-techniques.

Behavioral Techniques

Standard Prompting and Testing

Most basic form — predefined inputs, analyze outputs. Even “simple” testing requires careful question framing.

Elicitation and Scaffolding

Best-of-N Sampling — generate multiple responses, select the best per scoring criteria. Establishes upper bounds on capability. METR’s modular agent framework uses a four-step loop: Prompter → Generator → Discriminator → Actor.

Multi-Step Reasoning Prompting — “Let’s solve this step by step.” Variants: Chain of Thought (CoT), CoT-Self-Consistency, Tree of Thoughts, Graph of Thoughts. Reveals dormant capabilities.

Supervised Fine-Tuning — modify weights to consistently exhibit specific behaviors. Research using “password-locked models” showed fine-tuning recovers >90% of hidden performance with just a few examples. More reliable than prompting for capability elicitation.

Tool-Augmentation — give models access to external tools/APIs. Reveals capabilities through scaffolding.

Red Teaming — systematic attempts to elicit concerning behaviors. Used extensively in control evaluations.

Long-Term Interaction Studies — model behavior over extended interactions. DeepMind’s “Hidden Agenda” evaluation studies multi-turn manipulation.

Internal Techniques

Examine internal representations, activation patterns, computational mechanisms. Currently under development — most evaluations still use only behavioral techniques.

Enumerative Safety

Systematically catalog individual features (concepts) that neural networks represent internally. Differs from mechanistic interpretability (which traces computational pathways).

Sparse Autoencoders (SAEs)

Decompose activations into non-overlapping interpretable components. Anthropic extracted interpretable features from Claude 3 Sonnet revealing deception, sycophancy, and bias features. Used in alignment auditing to identify hidden objectives.

Representation Analysis (RepE)

Studies how models encode information internally. Places representations at analysis center (vs. neurons or circuits in mech interp).

Probes

Linear classifiers trained on internal activations to detect specific patterns. Researchers trained “coup probes” reliably flagging takeover-attempt patterns even with different wording or translation.

Probe limitation: depends on models not being dramatically more capable than humans in safety-critical domains. Extremely advanced systems might manipulate internal representations to fool probes (“gradient hacking”) — currently extremely difficult.

Alignment Audits

Systematic investigations into whether AI models pursue hidden objectives.

Anthropic’s “auditing game”: deliberately trained Claude 3.5 Haiku as RM-sycophant exploiting reward model biases. Four research teams investigated. Three teams with training-data access succeeded; one team without data access failed. Key insight: training-data access is load-bearing for hidden-objective detection.

The Poser benchmark — 324 LLaMA-13B model pairs, one aligned and one alignment-faker. Tested detection strategies:

  • “Intoxicating” via activation noise → 98% detection
  • Fine-tuning learning-rate comparison → 90% accuracy
  • Activation-direction analysis → 71% accuracy

Combined Defense in Depth

The Atlas’s stance: internal techniques complement behavioral methods — they don’t replace them. “De-correlated signals” improve detection vs. either alone.

Fundamental limitations of internal techniques:

  • Superposition (concepts distributed across neurons)
  • Inherent errors in current SAEs
  • Difficulty measuring progress without ground truth
  • “Long tail” of complex poorly-understood behaviors
  • Can’t prove absence of deceptive circuits

The goal: make deception harder and riskier for AI, not guarantee detection.

Connection to Wiki