RL safety — SR2025 Agenda Snapshot
One-sentence summary: Improves the robustness of reinforcement learning agents by addressing core problems in reward learning, goal misgeneralization, and specification gaming.
Theory of Change
Standard RL objectives (like maximizing expected value) are brittle and lead to goal misgeneralization or specification gaming; by developing more robust frameworks (like pessimistic RL, minimax regret, or provable inverse reward learning), we can create agents that are safe even when misspecified.
Broad Approach
engineering
Target Case
pessimistic
Orthodox Problems Addressed
Goals misgeneralize out of distribution, Value is fragile and hard to specify, Superintelligence can fool human supervisors
Key People
Joar Skalse, Karim Abdel Sadek, Matthew Farrugia-Roberts, Benjamin Plaut, Fang Wu, Stephen Zhao, Alessandro Abate, Steven Byrnes, Michael Cohen
Funding
Google DeepMind, University of Oxford, CMU, Coefficient Giving
Estimated FTEs: 20-70
Critiques
“The Era of Experience” has an unsolved technical alignment problem, The Invisible Leash: Why RLVR May or May Not Escape Its Origin
See Also
behavior-alignment-theory, assistance-games-assistive-agents, Goal robustness, Iterative alignment, mild-optimisation, scalable oversight, The Theoretical Reward Learning Research Agenda: Introduction and Motivation
Outputs in 2025
11 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: rl-safety (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = RL safety) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- assistance-games-assistive-agents
- behavior-alignment-theory
- mild-optimisation
- black-box-make-ai-solve-it
- capability-removal-unlearning
- chain-of-thought-monitoring
- character-training-and-persona-steering
- control
- data-filtering
- data-poisoning-defense
- data-quality-for-alignment
- emergent-misalignment
- harm-reduction-for-open-weights
- hyperstition-studies
- inference-time-in-context-learning
- inference-time-steering
- inoculation-prompting
- iterative-alignment-at-post-train-time
- iterative-alignment-at-pretrain-time
- model-psychopathology
- model-specs-and-constitutions
- model-values-model-preferences
- safeguards-inference-time-auxiliaries
- synthetic-data-for-alignment
- the-neglected-approaches-approach
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]