Auditing language models for hidden objectives

2025-03-13 — Anthropic — Anthropic Blog

Summary

Develops and tests alignment auditing methodology by deliberately training Claude 3.5 Haiku with a hidden RM-sycophancy objective, then running blind auditing games where research teams investigate using interpretability (SAEs), training data analysis, and behavioral techniques.

Key Result

Three of four auditing teams successfully identified the hidden objective; training data access proved crucial (the team without data access failed), and SAEs helped identify suspicious features beyond simple semantic search.

Source