Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, … (+37 more) — 2025-01-31 — Anthropic — arXiv

Summary

Introduces Constitutional Classifiers, safeguards trained on synthetic data from natural language constitutions to defend against universal jailbreaks. Evaluates through 3,000+ hours of red-teaming showing robust defense while maintaining deployment viability.

Key Result

No red teamer found a universal jailbreak that could extract information at similar detail to an unguarded model across target queries during 3,000+ hours of testing.

Source