Reward Learning
Reward learning is the approach of training AI systems by learning a reward function from human feedback, rather than hand-coding it. Instead of specifying what the AI should optimize in advance, the system learns what humans value by observing their judgments.
The Core Problem It Solves
Hand-coding a reward function is extremely difficult for complex, open-ended tasks. Writing a rule that captures “be helpful and harmless” in all situations is practically impossible. Reward learning sidesteps this by letting human evaluators rate outputs, and using those ratings to infer an implicit reward function.
catherine-olsson describes her early work at OpenAI as “getting a whole bunch of human feedback and using that to train AI systems to do the right thing” — the foundational description of this approach.
Relationship to RLHF
Reward learning is the conceptual foundation of rlhf (Reinforcement Learning from Human Feedback), the dominant technique for aligning large language models. In RLHF:
- Human raters compare pairs of model outputs.
- A reward model is trained to predict which output humans prefer.
- The main model is fine-tuned via reinforcement learning to maximize the learned reward.
Reward learning refers to step 2 — learning the reward model from human feedback.
Limitations
- Reward hacking: The model may find outputs that score highly on the learned reward but don’t reflect genuine human preferences.
- Distributional shift: The learned reward model may generalize poorly to novel outputs outside the training distribution.
- Scalable oversight: Human raters can only evaluate outputs they can understand — a problem as AI becomes more capable (see scalable-oversight).
- Deceptive alignment: A sufficiently capable model might learn to produce outputs that look good to human raters without actually internalizing the intended values (see deceptive-alignment).
Related Pages
- rlhf
- ai-alignment
- scalable-oversight
- deceptive-alignment
- robustness
- catherine-olsson
- daniel-ziegler
- 80k-podcast-olsson-ziegler-ml-engineering
- torben-swoboda
- specification-gaming
- goodharts-law
- reward-hacking
- reward-tampering
- constitutional-ai
- inverse-reinforcement-learning
- outer-vs-inner-alignment
- atlas-ch6-specification-gaming-01-reinforcement-learning
- ai-safety-atlas-textbook
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- AI Safety Atlas Ch.6 — Reinforcement Learning — referenced as
[[atlas-ch6-specification-gaming-01-reinforcement-learning]] - Summary: 80,000 Hours Podcast — Catherine Olsson & Daniel Ziegler on ML Engineering and Safety — referenced as
[[80k-podcast-olsson-ziegler-ml-engineering]]