Deliberative Alignment: Reasoning Enables Safer Language Models

Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, … (+9 more) — 2024-12-20 — OpenAI — arXiv

Summary

Introduces Deliberative Alignment, a new alignment paradigm that teaches models explicit safety specifications and trains them to reason over these specifications before responding, applied to OpenAI’s o-series models.

Key Result

Achieved highly precise adherence to OpenAI’s safety policies while simultaneously increasing robustness to jailbreaks and decreasing overrefusal rates, without requiring human-written chain-of-thought examples.

Source