Building and evaluating alignment auditing agents

Trenton Bricken, Rowan Wang, Sam Bowman, Euan Ong, Johannes Treutlein, Jeff Wu, … (+2 more) — 2025-07-24 — Anthropic — Anthropic Alignment Science Blog

Summary

Develops and evaluates three AI agents that autonomously conduct alignment auditing tasks: an investigator agent using interpretability tools, an evaluation agent that builds behavioral tests, and a breadth-first red-teaming agent. Agents are tested on models with intentionally-inserted alignment issues and applied to Claude 4 auditing.

Key Result

Investigator agent solves the Marks et al. auditing game 13% of the time (42% with super-agent aggregation), evaluation agent successfully discriminates quirky vs baseline models in 88% of runs, and breadth-first agent discovers 7/10 implanted test behaviors.

Source

Link: https://alignment.anthropic.com/2025/automated-auditing/
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Building and evaluating alignment auditing agents

Building and evaluating alignment auditing agents

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Building and evaluating alignment auditing agents

Building and evaluating alignment auditing agents

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents