Auditing language models for hidden objectives
2025-03-13 — Anthropic — Anthropic Blog
Summary
Develops and tests alignment auditing methodology by deliberately training Claude 3.5 Haiku with a hidden RM-sycophancy objective, then running blind auditing games where research teams investigate using interpretability (SAEs), training data analysis, and behavioral techniques.
Key Result
Three of four auditing teams successfully identified the hidden objective; training data access proved crucial (the team without data access failed), and SAEs helped identify suspicious features beyond simple semantic search.
Source
- Link: https://www.anthropic.com/research/auditing-hidden-objectives
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- anthropic — Labs (giant companies)
- extracting-latent-knowledge — White-box safety (i.e. Interpretability)