Auditing language models for hidden objectives

2025-03-13 — Anthropic — Anthropic Blog

Summary

Develops and tests alignment auditing methodology by deliberately training Claude 3.5 Haiku with a hidden RM-sycophancy objective, then running blind auditing games where research teams investigate using interpretability (SAEs), training data analysis, and behavioral techniques.

Key Result

Three of four auditing teams successfully identified the hidden objective; training data access proved crucial (the team without data access failed), and SAEs helped identify suspicious features beyond simple semantic search.

Source

Link: https://www.anthropic.com/research/auditing-hidden-objectives
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- anthropic — Labs (giant companies)
- extracting-latent-knowledge — White-box safety (i.e. Interpretability)

anthropic
extracting-latent-knowledge

AI Safety Compendium

Explorer

Auditing language models for hidden objectives

Auditing language models for hidden objectives

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Auditing language models for hidden objectives

Auditing language models for hidden objectives

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents