Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, … (+37 more) — 2025-01-31 — Anthropic — arXiv
Summary
Introduces Constitutional Classifiers, safeguards trained on synthetic data from natural language constitutions to defend against universal jailbreaks. Evaluates through 3,000+ hours of red-teaming showing robust defense while maintaining deployment viability.
Key Result
No red teamer found a universal jailbreak that could extract information at similar detail to an unguarded model across target queries during 3,000+ hours of testing.
Source
- Link: https://arxiv.org/abs/2501.18837
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- safeguards-inference-time-auxiliaries — Black-box safety (understand and control current model behaviour) / Iterative alignment