Building and evaluating alignment auditing agents

Sam Marks, trentbrick, RowanWang, Sam Bowman, Euan Ong, Johannes Treutlein, … (+1 more) — 2025-07-24 — Anthropic — LessWrong / AI Alignment Forum

Summary

Develops three autonomous AI agents that perform alignment auditing tasks (investigation, evaluation building, red-teaming) and evaluates their performance on models with researcher-implanted concerning behaviors, achieving 42% success on hidden goal discovery, 88% on behavioral evaluation creation, and 7/10 on behavior discovery.

Key Result

Investigator agent solves the Marks et al. auditing game 42% of the time when running multiple agents in parallel; evaluation agent successfully discriminates models with vs. without implanted behaviors in 88% of runs; red-teaming agent discovers 7/10 implanted test behaviors.

Source