RL safety — SR2025 Agenda Snapshot

One-sentence summary: Improves the robustness of reinforcement learning agents by addressing core problems in reward learning, goal misgeneralization, and specification gaming.

Theory of Change

Standard RL objectives (like maximizing expected value) are brittle and lead to goal misgeneralization or specification gaming; by developing more robust frameworks (like pessimistic RL, minimax regret, or provable inverse reward learning), we can create agents that are safe even when misspecified.

Broad Approach

engineering

Target Case

pessimistic

Orthodox Problems Addressed

Goals misgeneralize out of distribution, Value is fragile and hard to specify, Superintelligence can fool human supervisors

Key People

Joar Skalse, Karim Abdel Sadek, Matthew Farrugia-Roberts, Benjamin Plaut, Fang Wu, Stephen Zhao, Alessandro Abate, Steven Byrnes, Michael Cohen

Funding

Google DeepMind, University of Oxford, CMU, Coefficient Giving

Estimated FTEs: 20-70

Critiques

“The Era of Experience” has an unsolved technical alignment problem, The Invisible Leash: Why RLVR May or May Not Escape Its Origin

Outputs in 2025

11 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: rl-safety (these were generated alongside this file from the same export).

Source

Row in shallow-review-2025/agendas.csv (name = RL safety) — Shallow Review of Technical AI Safety 2025.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

Summary: AI Safety (Wikipedia) — referenced as [[ai-safety]]

AI Safety Compendium

Explorer

RL safety

RL safety — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Orthodox Problems Addressed

Key People

Funding

Critiques

See Also

Outputs in 2025

Source

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

RL safety

RL safety — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Orthodox Problems Addressed

Key People

Funding

Critiques

See Also

Outputs in 2025

Source

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks