RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation
Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac — 2025-01-15 — Princeton University — arXiv
Summary
Presents Reinforcement Learning from Hindsight Simulation (RLHS), which mitigates systematic misalignment in RLHF by conditioning evaluator feedback on simulated downstream outcomes rather than foresight predictions, preventing Goodhart’s law dynamics.
Key Result
RLHS substantially improves alignment over RLHF across three consultancy settings using both PPO and DPO methods, with improvements persisting across TruthfulQA, HaluEval, and TrustLLM benchmarks.
Source
- Link: https://arxiv.org/abs/2501.08617
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment