Safety cases for Pessimism

Michael Cohen — LessWrong

Summary

Proposes pessimistic RL as a unified approach to alignment challenges including truthfulness, feedback tampering, and the ELK problem, where an adversary learns plausible world-models that pessimistically assign low rewards and the agent trains in this adversarial environment.

Source