Constitutional Classifiers: Defending against universal jailbreaks
Anthropic Safeguards Research Team — 2025-02-03 — Anthropic — Anthropic Research Blog
Summary
Anthropic announces Constitutional Classifiers, a defense method using synthetically-trained input/output classifiers that reduced jailbreak success rates from 86% to 4.4% while maintaining low overrefusal rates, validated through 3,000+ hours of human red teaming and a public demo.
Key Result
Constitutional Classifiers reduced jailbreak success rate from 86% to 4.4% with only 0.38% increase in refusal rates and 23.7% compute overhead; no universal jailbreaks were found during initial 3,000+ hour red teaming period, though the live demo was eventually jailbroken after significant effort.
Source
- Link: https://www.anthropic.com/research/constitutional-classifiers
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)