Safe Learning Under Irreversible Dynamics via Asking for Help
Benjamin Plaut, Juan Liévano-Karim, Hanlin Zhu, Stuart Russell — 2025-02-19 — UC Berkeley — arXiv
Summary
Presents an algorithm with formal regret guarantees that enables safe reinforcement learning in environments with irreversible dynamics by allowing agents to ask mentors for help and transfer knowledge between similar states, achieving sublinear regret and mentor queries.
Key Result
First formal proof that an agent can obtain high reward while becoming self-sufficient in an unknown, unbounded, high-stakes environment without resets, with both regret and mentor queries sublinear in the time horizon.
Source
- Link: https://arxiv.org/abs/2502.14043
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- rl-safety — Black-box safety (understand and control current model behaviour) / Goal robustness