Cost-Effective Constitutional Classifiers via Representation Re-use
Hoagy Cunningham, Alwin Peng, Jerry Wei, Euan Ong, Fabien Roger, Linda Petrini, … (+3 more) — 2025 — Anthropic — Anthropic Alignment Science Blog
Summary
Develops cost-effective jailbreak detection methods by reusing policy model computations through linear probes on activations and partial fine-tuning of final layers, achieving performance comparable to dedicated classifiers at 2-4% computational overhead.
Key Result
Fine-tuning just the final layer matches performance of a dedicated classifier with 1/4 of the policy model’s parameters at ~4% cost, while EMA linear probes achieve comparable performance to a 2% overhead classifier at virtually no additional cost.
Source
- Link: https://alignment.anthropic.com/2025/cheap-monitors
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- monitoring-concepts — White-box safety (i.e. Interpretability) / Concept-based interpretability