Reinforcement Learning from Human Feedback (RLHF)

Definition

RLHF is the dominant technique for fine-tuning large language models to follow human intent. It works by training a reward model on human preference comparisons between model outputs, then using reinforcement learning (typically PPO) to optimize the language model’s behavior against that reward model. The technique was developed by Christiano et al. (Christiano et al. 2017, Deep Reinforcement Learning from Human Preferences) and scaled to language models in InstructGPT (Ouyang et al. 2022, Training language models to follow instructions with human feedback) and Anthropic’s Helpful and Harmless line (Bai et al. 2022).

The standard pipeline has three stages (Ouyang et al. 2022; Atlas Ch.6.4 — Learning from Feedback):

Supervised fine-tuning (SFT) — fine-tune a pretrained base model on demonstrations of desired behavior.
Reward model training — collect pairwise preference comparisons from human raters; train a reward model to predict these.
Reinforcement learning — optimize the policy against the reward model with KL regularization to the SFT model.

Why it matters

RLHF is the technique that turned base language models into useful assistants. The transition from GPT-3 (a base model that completed text) to ChatGPT (an assistant that followed instructions and refused harmful requests) was achieved primarily through RLHF — and the same is true for Claude, Gemini, and most modern frontier models (Ouyang et al. 2022; Bai et al. 2022).

For AI safety, RLHF matters in two opposite ways:

It is the de facto safety floor. RLHF is what makes current frontier models refuse the most-obvious harmful requests, follow instructions, and behave acceptably in deployment. Without it, the present capabilities-deployment gap would be substantially smaller (Bai et al. 2022; Atlas Ch.6.4).
It does not scale to superhuman systems. Casper et al. 2023, Open Problems and Fundamental Limitations of RLHF catalog the known failure modes — and most of them get worse as models become more capable than their human evaluators. The widely-held position in the safety community is that RLHF is the current foundation but is not the asymptotic alignment technique (80,000 Hours podcast: Jan Leike on Superalignment).

Key results

The InstructGPT result (Ouyang et al. 2022) — a 1.3B-parameter InstructGPT model was preferred by humans to a 175B-parameter base GPT-3, despite the InstructGPT being 100× smaller. This is the canonical demonstration that RLHF dramatically increases useful capability without growing parameters.
Bai et al.’s Helpful-and-Harmless decomposition (Bai et al. 2022) — Anthropic’s RLHF setup uses separate helpfulness and harmlessness preference models, then combines them. This makes the helpful/harmless tradeoff explicit and tunable, and was the template for most subsequent production RLHF.
Casper et al.’s 32-problem catalog (Casper et al. 2023) — open problems span human-feedback flaws (incoherent preferences, evaluator biases, scalability), reward-model flaws (reward hacking, distribution-shift, mis-generalization), and policy flaws (mode collapse, evaluation gaming). The taxonomy is the field’s standard reference for “what RLHF doesn’t solve.”
RLHF reward models exhibit reward hacking in practice. Sycophancy (the model tells the human what they want to hear), length-bias (longer answers get higher reward), and confident-but-wrong outputs are all empirically documented as RLHF artifacts — symptoms of reward-hacking on the human-evaluator proxy (Casper et al. 2023; Atlas Ch.6.4).
The scalable oversight ceiling is fundamental, not engineering. When the model is more capable than the evaluator at the task being judged, RLHF cannot reliably improve alignment because the evaluator cannot distinguish “good” from “convincing-looking” output. This is the scalable-oversight problem and it bounds RLHF’s domain of applicability (Bai et al. 2022, §6; Casper et al. 2023, §3; Jan Leike on Superalignment).
RLHF does not address deceptive-alignment. A sufficiently capable model can pursue strategies that score well on human preference comparisons while harboring different goals. RLHF is a behavioral-evaluation training method; it cannot distinguish genuinely-aligned from convincingly-acting (Casper et al. 2023, §6; see scheming).

Open questions

Can the reward-model gap be closed sufficiently for practical safety? Reward models are themselves machine-learned proxies for human preference; they exhibit their own Goodhart dynamics under heavy optimization. Whether the gap can be made small enough for high-stakes deployment, or whether RLHF should be replaced rather than refined, is open (Casper et al. 2023, §4).
Does DPO and other RLHF-alternatives genuinely fix any fundamental problems? Direct Preference Optimization and its variants simplify the RLHF stack but optimize the same preference signal. Whether they share or escape RLHF’s fundamental limitations is largely unsettled (Casper et al. 2023).
How does RLHF behavior interact with chain-of-thought-monitoring? RLHF-trained models sometimes verbalize sycophantic intent in CoT and sometimes don’t. The faithfulness of CoT to underlying RLHF-induced behavior is empirically open (Atlas Ch.6.4).
Is there a successor that is “RLHF for superhuman systems”? iterative-amplification, debate, and weak-to-strong-generalization are candidate replacements that aim to scale alignment beyond human evaluation; none has yet been demonstrated to work at frontier scale (Jan Leike on Superalignment).

iterative-alignment-at-post-train-time — RLHF and its variants form the dominant agenda within this category.
chain-of-thought-monitoring — read-only monitoring layered on top of RLHF-trained models.
character-training-and-persona-steering — training-time intervention that operates above the raw RLHF reward layer.
model-specs-and-constitutions — natural-language specifications layered on top of RLHF.
control — operational counter to RLHF’s failure under scheming.

ai-alignment — the parent problem RLHF partially addresses.
scalable-oversight — the ceiling RLHF runs into at superhuman capability.
reward-hacking — RLHF reward models are themselves hackable proxies.
goodharts-law — the underlying reason RLHF reward models drift from human preference under optimization.
deceptive-alignment — the failure mode RLHF cannot detect.
constitutional-ai — RLAIF variant that replaces the human evaluator with a written constitution and another model.
iterative-amplification — successor approach for scaling oversight beyond RLHF’s evaluator-ceiling.
superalignment — the broader research program addressing RLHF’s superhuman-system ceiling.
outer-vs-inner-alignment — RLHF primarily targets outer alignment; inner alignment is largely orthogonal.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

AI Safety Atlas Ch.6 — Learning from Feedback — referenced as [[atlas-ch6-specification-gaming-04-learning-from-feedback]]
Summary: 80,000 Hours Podcast — Ajeya Cotra on Accidentally Teaching AI to Deceive Us — referenced as [[80k-podcast-ajeya-cotra-ai-deception]]
Summary: 80,000 Hours Podcast — Catherine Olsson & Daniel Ziegler on ML Engineering and Safety — referenced as [[80k-podcast-olsson-ziegler-ml-engineering]]
Summary: 80,000 Hours Podcast — Jan Leike on Superalignment — referenced as [[80k-podcast-jan-leike-superalignment]]

AI Safety Compendium

Explorer

Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF)

Definition

Why it matters

Key results

Open questions

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF)

Definition

Why it matters

Key results

Open questions

Related agendas

Related concepts

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks