RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac — 2025-01-15 — Princeton University — arXiv

Summary

Presents Reinforcement Learning from Hindsight Simulation (RLHS), which mitigates systematic misalignment in RLHF by conditioning evaluator feedback on simulated downstream outcomes rather than foresight predictions, preventing Goodhart’s law dynamics.

Key Result

RLHS substantially improves alignment over RLHF across three consultancy settings using both PPO and DPO methods, with improvements persisting across TruthfulQA, HaluEval, and TrustLLM benchmarks.

Source