Safety cases for Pessimism
Michael Cohen — LessWrong
Summary
Proposes pessimistic RL as a unified approach to alignment challenges including truthfulness, feedback tampering, and the ELK problem, where an adversary learns plausible world-models that pessimistically assign low rewards and the agent trains in this adversarial environment.
Source
- Link: https://lesswrong.com/posts/CpftMXCEnwqbWreHD/safety-cases-for-pessimism
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- rl-safety — Black-box safety (understand and control current model behaviour) / Goal robustness