RL safety — SR2025 Agenda Snapshot

One-sentence summary: Improves the robustness of reinforcement learning agents by addressing core problems in reward learning, goal misgeneralization, and specification gaming.

Theory of Change

Standard RL objectives (like maximizing expected value) are brittle and lead to goal misgeneralization or specification gaming; by developing more robust frameworks (like pessimistic RL, minimax regret, or provable inverse reward learning), we can create agents that are safe even when misspecified.

Broad Approach

engineering

Target Case

pessimistic

Orthodox Problems Addressed

Goals misgeneralize out of distribution, Value is fragile and hard to specify, Superintelligence can fool human supervisors

Key People

Joar Skalse, Karim Abdel Sadek, Matthew Farrugia-Roberts, Benjamin Plaut, Fang Wu, Stephen Zhao, Alessandro Abate, Steven Byrnes, Michael Cohen

Funding

Google DeepMind, University of Oxford, CMU, Coefficient Giving

Estimated FTEs: 20-70

Critiques

“The Era of Experience” has an unsolved technical alignment problem, The Invisible Leash: Why RLVR May or May Not Escape Its Origin

See Also

behavior-alignment-theory, assistance-games-assistive-agents, Goal robustness, Iterative alignment, mild-optimisation, scalable oversight, The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Outputs in 2025

11 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: rl-safety (these were generated alongside this file from the same export).

Source

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.