AI Safety Atlas Ch.6 — Reinforcement Learning
Source: Reinforcement Learning
Foundational RL concepts establishing the technical vocabulary for the rest of Ch.6. “Reinforcement Learning focuses on developing agents that can learn from interactive experiences” through rewards.
The Core RL Loop
Three steps repeating until terminal:
- Agent takes action
- Environment state changes
- System outputs observations + reward signal
Three Foundational Objects
Policies
Functions mapping states to actions. Two types:
- Deterministic — each state → one specific action
- Stochastic — each state → probability distribution over actions
Optimal policies maximize expected cumulative rewards over time.
Reward Functions
Map environmental states or state-action pairs to numerical values, providing immediate feedback.
Value Functions
Represent long-term desirability — distinct from rewards which represent immediate desirability.
“The reward indicates the immediate desirability of states or actions, while a value function represents the long-term desirability.”
Important Disambiguations
- Rewards vs. utility functions — utility = subjective preferences in decision theory; rewards = immediate learning signals
- Reward hacking ≠ reward tampering — see reward-hacking and reward-tampering
Real-World Applications
The Atlas catalogs:
- Robotic control — movement learning
- Recommendation systems — engagement optimization
- Game-playing — AlphaGo, AlphaStar, MuZero
The Safety Warning
The section ends with the chapter’s foundational concern: “Misaligned reward functions create dangerous incentives: agents may manipulate metrics rather than accomplish intended objectives.”
This sets up Goodhart’s Law (next subchapter) and the broader specification-gaming failure mode that defines the chapter.
Connection to Wiki
- reward-learning — adjacent concept
- rlhf — RL applied with human feedback
- ai-alignment — the broader concern
- goodharts-law — what comes next