Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, … (+37 more) — 2025-01-31 — Anthropic — arXiv

Summary

Introduces Constitutional Classifiers, safeguards trained on synthetic data from natural language constitutions to defend against universal jailbreaks. Evaluates through 3,000+ hours of red-teaming showing robust defense while maintaining deployment viability.

Key Result

No red teamer found a universal jailbreak that could extract information at similar detail to an unguarded model across target queries during 3,000+ hours of testing.

Source

Link: https://arxiv.org/abs/2501.18837
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- safeguards-inference-time-auxiliaries — Black-box safety (understand and control current model behaviour) / Iterative alignment

safeguards-inference-time-auxiliaries

AI Safety Compendium

Explorer

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents