Building and evaluating alignment auditing agents

Trenton Bricken, Rowan Wang, Sam Bowman, Euan Ong, Johannes Treutlein, Jeff Wu, … (+2 more) — 2025-07-24 — Anthropic — Anthropic Alignment Science Blog

Summary

Develops and evaluates three AI agents that autonomously conduct alignment auditing tasks: an investigator agent using interpretability tools, an evaluation agent that builds behavioral tests, and a breadth-first red-teaming agent. Agents are tested on models with intentionally-inserted alignment issues and applied to Claude 4 auditing.

Key Result

Investigator agent solves the Marks et al. auditing game 13% of the time (42% with super-agent aggregation), evaluation agent successfully discriminates quirky vs baseline models in 88% of runs, and breadth-first agent discovers 7/10 implanted test behaviors.

Source