Building and evaluating alignment auditing agents

Sam Marks, trentbrick, RowanWang, Sam Bowman, Euan Ong, Johannes Treutlein, … (+1 more) — 2025-07-24 — Anthropic — LessWrong / AI Alignment Forum

Summary

Develops three autonomous AI agents that perform alignment auditing tasks (investigation, evaluation building, red-teaming) and evaluates their performance on models with researcher-implanted concerning behaviors, achieving 42% success on hidden goal discovery, 88% on behavioral evaluation creation, and 7/10 on behavior discovery.

Key Result

Investigator agent solves the Marks et al. auditing game 42% of the time when running multiple agents in parallel; evaluation agent successfully discriminates models with vs. without implanted behaviors in 88% of runs; red-teaming agent discovers 7/10 implanted test behaviors.

Source

Link: https://lesswrong.com/posts/DJAZHYjWxMrcd2na3/building-and-evaluating-alignment-auditing-agents
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- reverse-engineering — White-box safety (i.e. Interpretability)

reverse-engineering

AI Safety Compendium

Explorer

Building and evaluating alignment auditing agents

Building and evaluating alignment auditing agents

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Building and evaluating alignment auditing agents

Building and evaluating alignment auditing agents

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents