Cost-Effective Constitutional Classifiers via Representation Re-use

Hoagy Cunningham, Alwin Peng, Jerry Wei, Euan Ong, Fabien Roger, Linda Petrini, … (+3 more) — 2025 — Anthropic — Anthropic Alignment Science Blog

Summary

Develops cost-effective jailbreak detection methods by reusing policy model computations through linear probes on activations and partial fine-tuning of final layers, achieving performance comparable to dedicated classifiers at 2-4% computational overhead.

Key Result

Fine-tuning just the final layer matches performance of a dedicated classifier with 1/4 of the policy model’s parameters at ~4% cost, while EMA linear probes achieve comparable performance to a 2% overhead classifier at virtually no additional cost.

Source