Deliberative Alignment: Reasoning Enables Safer Language Models

Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, … (+9 more) — 2024-12-20 — OpenAI — arXiv

Summary

Introduces Deliberative Alignment, a new alignment paradigm that teaches models explicit safety specifications and trains them to reason over these specifications before responding, applied to OpenAI’s o-series models.

Key Result

Achieved highly precise adherence to OpenAI’s safety policies while simultaneously increasing robustness to jailbreaks and decreasing overrefusal rates, without requiring human-written chain-of-thought examples.

Source

Link: https://arxiv.org/abs/2412.16339
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- openai — Labs (giant companies)
- model-specs-and-constitutions — Black-box safety (understand and control current model behaviour) / Model psychology

openai
model-specs-and-constitutions

AI Safety Compendium

Explorer

Deliberative Alignment: Reasoning Enables Safer Language Models

Deliberative Alignment: Reasoning Enables Safer Language Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Deliberative Alignment: Reasoning Enables Safer Language Models

Deliberative Alignment: Reasoning Enables Safer Language Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents