Cost-Effective Constitutional Classifiers via Representation Re-use

Hoagy Cunningham, Alwin Peng, Jerry Wei, Euan Ong, Fabien Roger, Linda Petrini, … (+3 more) — 2025 — Anthropic — Anthropic Alignment Science Blog

Summary

Develops cost-effective jailbreak detection methods by reusing policy model computations through linear probes on activations and partial fine-tuning of final layers, achieving performance comparable to dedicated classifiers at 2-4% computational overhead.

Key Result

Fine-tuning just the final layer matches performance of a dedicated classifier with 1/4 of the policy model’s parameters at ~4% cost, while EMA linear probes achieve comparable performance to a 2% overhead classifier at virtually no additional cost.

Source

Link: https://alignment.anthropic.com/2025/cheap-monitors
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- monitoring-concepts — White-box safety (i.e. Interpretability) / Concept-based interpretability

monitoring-concepts

AI Safety Compendium

Explorer

Cost-Effective Constitutional Classifiers via Representation Re-use

Cost-Effective Constitutional Classifiers via Representation Re-use

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Cost-Effective Constitutional Classifiers via Representation Re-use

Cost-Effective Constitutional Classifiers via Representation Re-use

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents