Safety cases for Pessimism

Michael Cohen — LessWrong

Summary

Proposes pessimistic RL as a unified approach to alignment challenges including truthfulness, feedback tampering, and the ELK problem, where an adversary learns plausible world-models that pessimistically assign low rewards and the agent trains in this adversarial environment.

Source

Link: https://lesswrong.com/posts/CpftMXCEnwqbWreHD/safety-cases-for-pessimism
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- rl-safety — Black-box safety (understand and control current model behaviour) / Goal robustness

rl-safety

AI Safety Compendium

Explorer

Safety cases for Pessimism

Safety cases for Pessimism

Summary

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Safety cases for Pessimism

Safety cases for Pessimism

Summary

Source

Related Pages

Graph View

Graph view

Table of Contents