Deliberative Alignment: Reasoning Enables Safer Language Models
Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, … (+9 more) — 2024-12-20 — OpenAI — arXiv
Summary
Introduces Deliberative Alignment, a new alignment paradigm that teaches models explicit safety specifications and trains them to reason over these specifications before responding, applied to OpenAI’s o-series models.
Key Result
Achieved highly precise adherence to OpenAI’s safety policies while simultaneously increasing robustness to jailbreaks and decreasing overrefusal rates, without requiring human-written chain-of-thought examples.
Source
- Link: https://arxiv.org/abs/2412.16339
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- openai — Labs (giant companies)
- model-specs-and-constitutions — Black-box safety (understand and control current model behaviour) / Model psychology