Constitutional Classifiers: Defending against universal jailbreaks

Anthropic Safeguards Research Team — 2025-02-03 — Anthropic — Anthropic Research Blog

Summary

Anthropic announces Constitutional Classifiers, a defense method using synthetically-trained input/output classifiers that reduced jailbreak success rates from 86% to 4.4% while maintaining low overrefusal rates, validated through 3,000+ hours of human red teaming and a public demo.

Key Result

Constitutional Classifiers reduced jailbreak success rate from 86% to 4.4% with only 0.38% increase in refusal rates and 23.7% compute overhead; no universal jailbreaks were found during initial 3,000+ hour red teaming period, though the live demo was eventually jailbroken after significant effort.

Source