Building and evaluating alignment auditing agents
Trenton Bricken, Rowan Wang, Sam Bowman, Euan Ong, Johannes Treutlein, Jeff Wu, … (+2 more) — 2025-07-24 — Anthropic — Anthropic Alignment Science Blog
Summary
Develops and evaluates three AI agents that autonomously conduct alignment auditing tasks: an investigator agent using interpretability tools, an evaluation agent that builds behavioral tests, and a breadth-first red-teaming agent. Agents are tested on models with intentionally-inserted alignment issues and applied to Claude 4 auditing.
Key Result
Investigator agent solves the Marks et al. auditing game 13% of the time (42% with super-agent aggregation), evaluation agent successfully discriminates quirky vs baseline models in 88% of runs, and breadth-first agent discovers 7/10 implanted test behaviors.
Source
- Link: https://alignment.anthropic.com/2025/automated-auditing/
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals