Interpreting Emergent Planning in Model-Free Reinforcement Learning
Thomas Bush, Stephen Chung, Usman Anwar, Adrià Garriga-Alonso, David Krueger — 2025-04-02
Source
- Link: https://arxiv.org/pdf/2504.01871
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- reverse-engineering — White-box safety (i.e. Interpretability)