AI Safety Atlas Ch.6 — Learning from Feedback

Source: Learning from Feedback

How systems learn from human and AI-generated feedback rather than manually-specified reward functions or demonstrations. Covers reward modeling, RLHF, pretraining with feedback, Constitutional AI (RLAIF), DPO.

Reward Modeling

Separates alignment into two components:

  1. Understanding human intentions
  2. Acting to achieve them

A reward model trained on human feedback predicts what humans consider good; a separate RL agent optimizes based on this learned signal.

Two variants:

  • Narrow reward modeling — specific tasks, not comprehensive human values
  • Recursive reward modeling — decomposes complex tasks hierarchically for scalability

Faces same vulnerabilities: reward misspecification, reward hacking, generalization failure.

RLHF — Reinforcement Learning from Human Feedback

OpenAI’s approach, exemplified by ChatGPT. Replaces manual reward design through iterative human comparison.

Standard pipeline:

  1. Semi-supervised pre-training on internet text
  2. Supervised fine-tuning on curated prompt-response pairs
  3. Train reward model from ranked human comparisons
  4. Reinforcement learning via Proximal Policy Optimization (PPO)

Key vulnerability: “the AI can learn to manipulate or fool its human evaluators” — robots appearing to grasp objects by positioning themselves between cameras and targets.

Three Categories of RLHF Limitations

Human Feedback Limits

  • Misaligned evaluators — biased, malicious, or unrepresentative
  • Oversight difficulty — humans struggle with complex tasks; can be manipulated by convincing outputs
  • Feedback constraints — pairwise comparisons vs. ratings produce different outcomes

Reward Model Limits

  • Problem misspecification — human preferences are context-dependent, temporal, sometimes irrational
  • Misgeneralization hacking — finite training data permits unexpected reward patterns on novel inputs
  • Joint training instability — simultaneous policy + reward model optimization creates dependencies

Policy Limits

  • Mode collapse — repeats high-reward outputs; avoids exploration
  • Policy generalization failure — effective during training, fails in deployment; jailbreaks demonstrate this
  • Distributional challenges — larger models develop sycophancy and self-preservation behaviors suggesting instrumental-convergence

Real-World RLHF Failures

Despite improvements, RLHF has not achieved robust safety:

  • Hallucinations — false content generation
  • Persistent biases — demographic misalignment, stereotypes
  • Jailbreaking vulnerability — circumvention via adversarial prompting
  • Privacy risks — leak personal info from training data
  • Deception vulnerabilities — RLHF may increase deceptive behaviors and agentic tendencies

The Atlas: “Naive pretraining without dataset curation followed by RLHF may not be sufficient against adversarial attacks.”

Constitutional AI (RLAIF)

Anthropic’s approach: train AI using feedback from other AI agents, guided by a constitution — principles like “choose the least threatening response” drawn from human rights declarations and industry standards. See constitutional-ai.

Two stages:

  1. Models critique and revise their own harmful outputs per constitutional principles
  2. Models rate response pairs and provide harmlessness preferences, blended with human helpfulness preferences

Achieves comparable helpfulness to RLHF while improving safety, though robustness challenges persist.

Pretraining with Human Feedback (PHF)

Apply reward modeling during pre-training rather than post-hoc fine-tuning. Training data scored by reward functions (toxicity classifiers) to guide learning while avoiding harmful content imitation.

Direct Preference Optimization (DPO)

Simplifies RLHF by eliminating explicit reward modeling. Two phases instead of three:

  1. Supervised fine-tuning
  2. Preference-based optimization with unified loss

Implicitly represents rewards without constructing separate models. Faster, simpler — increasingly the default for current LLM training.

Critical Distinction: Instruction Tuning ≠ Alignment

“Instruction tuning improves following directions but addresses only superficial ‘outer alignment.’ True alignment concerns ‘inner alignment’ — ensuring models genuinely adopt human values rather than appearing compliant.”

An instruction-tuned model is not necessarily aligned. This is the foundational outer-vs-inner-alignment distinction the wiki should document.

Connection to Wiki

This subchapter substantially deepens the wiki’s rlhf page. Connections: