Constitutional Classifiers: Defending against universal jailbreaks

Anthropic Safeguards Research Team — 2025-02-03 — Anthropic — Anthropic Research Blog

Summary

Anthropic announces Constitutional Classifiers, a defense method using synthetically-trained input/output classifiers that reduced jailbreak success rates from 86% to 4.4% while maintaining low overrefusal rates, validated through 3,000+ hours of human red teaming and a public demo.

Key Result

Constitutional Classifiers reduced jailbreak success rate from 86% to 4.4% with only 0.38% increase in refusal rates and 23.7% compute overhead; no universal jailbreaks were found during initial 3,000+ hour red teaming period, though the live demo was eventually jailbroken after significant effort.

Source

Link: https://www.anthropic.com/research/constitutional-classifiers
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)

anthropic

AI Safety Compendium

Explorer

Constitutional Classifiers: Defending against universal jailbreaks

Constitutional Classifiers: Defending against universal jailbreaks

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Constitutional Classifiers: Defending against universal jailbreaks

Constitutional Classifiers: Defending against universal jailbreaks

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents