AI Safety Atlas Ch.6 — Reinforcement Learning

Source: Reinforcement Learning

Foundational RL concepts establishing the technical vocabulary for the rest of Ch.6. “Reinforcement Learning focuses on developing agents that can learn from interactive experiences” through rewards.

The Core RL Loop

Three steps repeating until terminal:

Agent takes action
Environment state changes
System outputs observations + reward signal

Three Foundational Objects

Policies

Functions mapping states to actions. Two types:

Deterministic — each state → one specific action
Stochastic — each state → probability distribution over actions

Optimal policies maximize expected cumulative rewards over time.

Reward Functions

Map environmental states or state-action pairs to numerical values, providing immediate feedback.

Value Functions

Represent long-term desirability — distinct from rewards which represent immediate desirability.

“The reward indicates the immediate desirability of states or actions, while a value function represents the long-term desirability.”

Important Disambiguations

Rewards vs. utility functions — utility = subjective preferences in decision theory; rewards = immediate learning signals
Reward hacking ≠ reward tampering — see reward-hacking and reward-tampering

Real-World Applications

The Atlas catalogs:

Robotic control — movement learning
Recommendation systems — engagement optimization
Game-playing — AlphaGo, AlphaStar, MuZero

The Safety Warning

The section ends with the chapter’s foundational concern: “Misaligned reward functions create dangerous incentives: agents may manipulate metrics rather than accomplish intended objectives.”

This sets up Goodhart’s Law (next subchapter) and the broader specification-gaming failure mode that defines the chapter.

Connection to Wiki

reward-learning — adjacent concept
rlhf — RL applied with human feedback
ai-alignment — the broader concern
goodharts-law — what comes next

AI Safety Compendium

Explorer

AI Safety Atlas Ch.6 — Reinforcement Learning

AI Safety Atlas Ch.6 — Reinforcement Learning

The Core RL Loop

Three Foundational Objects

Policies

Reward Functions

Value Functions

Important Disambiguations

Real-World Applications

The Safety Warning

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.6 — Reinforcement Learning

AI Safety Atlas Ch.6 — Reinforcement Learning

The Core RL Loop

Three Foundational Objects

Policies

Reward Functions

Value Functions

Important Disambiguations

Real-World Applications

The Safety Warning

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks