Safe Learning Under Irreversible Dynamics via Asking for Help

Benjamin Plaut, Juan Liévano-Karim, Hanlin Zhu, Stuart Russell — 2025-02-19 — UC Berkeley — arXiv

Summary

Presents an algorithm with formal regret guarantees that enables safe reinforcement learning in environments with irreversible dynamics by allowing agents to ask mentors for help and transfer knowledge between similar states, achieving sublinear regret and mentor queries.

Key Result

First formal proof that an agent can obtain high reward while becoming self-sufficient in an unknown, unbounded, high-stakes environment without resets, with both regret and mentor queries sublinear in the time horizon.

Source

Link: https://arxiv.org/abs/2502.14043
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- rl-safety — Black-box safety (understand and control current model behaviour) / Goal robustness

rl-safety

AI Safety Compendium

Explorer

Safe Learning Under Irreversible Dynamics via Asking for Help

Safe Learning Under Irreversible Dynamics via Asking for Help

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Safe Learning Under Irreversible Dynamics via Asking for Help

Safe Learning Under Irreversible Dynamics via Asking for Help

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents