AI Safety Atlas Ch.6 — Learning from Feedback
Source: Learning from Feedback
How systems learn from human and AI-generated feedback rather than manually-specified reward functions or demonstrations. Covers reward modeling, RLHF, pretraining with feedback, Constitutional AI (RLAIF), DPO.
Reward Modeling
Separates alignment into two components:
- Understanding human intentions
- Acting to achieve them
A reward model trained on human feedback predicts what humans consider good; a separate RL agent optimizes based on this learned signal.
Two variants:
- Narrow reward modeling — specific tasks, not comprehensive human values
- Recursive reward modeling — decomposes complex tasks hierarchically for scalability
Faces same vulnerabilities: reward misspecification, reward hacking, generalization failure.
RLHF — Reinforcement Learning from Human Feedback
OpenAI’s approach, exemplified by ChatGPT. Replaces manual reward design through iterative human comparison.
Standard pipeline:
- Semi-supervised pre-training on internet text
- Supervised fine-tuning on curated prompt-response pairs
- Train reward model from ranked human comparisons
- Reinforcement learning via Proximal Policy Optimization (PPO)
Key vulnerability: “the AI can learn to manipulate or fool its human evaluators” — robots appearing to grasp objects by positioning themselves between cameras and targets.
Three Categories of RLHF Limitations
Human Feedback Limits
- Misaligned evaluators — biased, malicious, or unrepresentative
- Oversight difficulty — humans struggle with complex tasks; can be manipulated by convincing outputs
- Feedback constraints — pairwise comparisons vs. ratings produce different outcomes
Reward Model Limits
- Problem misspecification — human preferences are context-dependent, temporal, sometimes irrational
- Misgeneralization hacking — finite training data permits unexpected reward patterns on novel inputs
- Joint training instability — simultaneous policy + reward model optimization creates dependencies
Policy Limits
- Mode collapse — repeats high-reward outputs; avoids exploration
- Policy generalization failure — effective during training, fails in deployment; jailbreaks demonstrate this
- Distributional challenges — larger models develop sycophancy and self-preservation behaviors suggesting instrumental-convergence
Real-World RLHF Failures
Despite improvements, RLHF has not achieved robust safety:
- Hallucinations — false content generation
- Persistent biases — demographic misalignment, stereotypes
- Jailbreaking vulnerability — circumvention via adversarial prompting
- Privacy risks — leak personal info from training data
- Deception vulnerabilities — RLHF may increase deceptive behaviors and agentic tendencies
The Atlas: “Naive pretraining without dataset curation followed by RLHF may not be sufficient against adversarial attacks.”
Constitutional AI (RLAIF)
Anthropic’s approach: train AI using feedback from other AI agents, guided by a constitution — principles like “choose the least threatening response” drawn from human rights declarations and industry standards. See constitutional-ai.
Two stages:
- Models critique and revise their own harmful outputs per constitutional principles
- Models rate response pairs and provide harmlessness preferences, blended with human helpfulness preferences
Achieves comparable helpfulness to RLHF while improving safety, though robustness challenges persist.
Pretraining with Human Feedback (PHF)
Apply reward modeling during pre-training rather than post-hoc fine-tuning. Training data scored by reward functions (toxicity classifiers) to guide learning while avoiding harmful content imitation.
Direct Preference Optimization (DPO)
Simplifies RLHF by eliminating explicit reward modeling. Two phases instead of three:
- Supervised fine-tuning
- Preference-based optimization with unified loss
Implicitly represents rewards without constructing separate models. Faster, simpler — increasingly the default for current LLM training.
Critical Distinction: Instruction Tuning ≠ Alignment
“Instruction tuning improves following directions but addresses only superficial ‘outer alignment.’ True alignment concerns ‘inner alignment’ — ensuring models genuinely adopt human values rather than appearing compliant.”
An instruction-tuned model is not necessarily aligned. This is the foundational outer-vs-inner-alignment distinction the wiki should document.
Connection to Wiki
This subchapter substantially deepens the wiki’s rlhf page. Connections:
- rlhf — substantially deepened
- reward-learning — adjacent concept
- constitutional-ai — new dedicated page
- outer-vs-inner-alignment — critical disambiguation
- deceptive-alignment — what RLHF may inadvertently increase
- ai-alignment — broader context