Building and evaluating alignment auditing agents
Sam Marks, trentbrick, RowanWang, Sam Bowman, Euan Ong, Johannes Treutlein, … (+1 more) — 2025-07-24 — Anthropic — LessWrong / AI Alignment Forum
Summary
Develops three autonomous AI agents that perform alignment auditing tasks (investigation, evaluation building, red-teaming) and evaluates their performance on models with researcher-implanted concerning behaviors, achieving 42% success on hidden goal discovery, 88% on behavioral evaluation creation, and 7/10 on behavior discovery.
Key Result
Investigator agent solves the Marks et al. auditing game 42% of the time when running multiple agents in parallel; evaluation agent successfully discriminates models with vs. without implanted behaviors in 88% of runs; red-teaming agent discovers 7/10 implanted test behaviors.
Source
- Link: https://lesswrong.com/posts/DJAZHYjWxMrcd2na3/building-and-evaluating-alignment-auditing-agents
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- reverse-engineering — White-box safety (i.e. Interpretability)