AI Safety Atlas Ch.6 — Reinforcement Learning

Source: Reinforcement Learning

Foundational RL concepts establishing the technical vocabulary for the rest of Ch.6. “Reinforcement Learning focuses on developing agents that can learn from interactive experiences” through rewards.

The Core RL Loop

Three steps repeating until terminal:

  1. Agent takes action
  2. Environment state changes
  3. System outputs observations + reward signal

Three Foundational Objects

Policies

Functions mapping states to actions. Two types:

  • Deterministic — each state → one specific action
  • Stochastic — each state → probability distribution over actions

Optimal policies maximize expected cumulative rewards over time.

Reward Functions

Map environmental states or state-action pairs to numerical values, providing immediate feedback.

Value Functions

Represent long-term desirability — distinct from rewards which represent immediate desirability.

“The reward indicates the immediate desirability of states or actions, while a value function represents the long-term desirability.”

Important Disambiguations

  • Rewards vs. utility functions — utility = subjective preferences in decision theory; rewards = immediate learning signals
  • Reward hackingreward tampering — see reward-hacking and reward-tampering

Real-World Applications

The Atlas catalogs:

  • Robotic control — movement learning
  • Recommendation systems — engagement optimization
  • Game-playing — AlphaGo, AlphaStar, MuZero

The Safety Warning

The section ends with the chapter’s foundational concern: “Misaligned reward functions create dangerous incentives: agents may manipulate metrics rather than accomplish intended objectives.”

This sets up Goodhart’s Law (next subchapter) and the broader specification-gaming failure mode that defines the chapter.

Connection to Wiki