RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac — 2025-01-15 — Princeton University — arXiv

Summary

Presents Reinforcement Learning from Hindsight Simulation (RLHS), which mitigates systematic misalignment in RLHF by conditioning evaluator feedback on simulated downstream outcomes rather than foresight predictions, preventing Goodhart’s law dynamics.

Key Result

RLHS substantially improves alignment over RLHF across three consultancy settings using both PPO and DPO methods, with improvements persisting across TruthfulQA, HaluEval, and TrustLLM benchmarks.

Source

Link: https://arxiv.org/abs/2501.08617
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment

iterative-alignment-at-post-train-time

AI Safety Compendium

Explorer

RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents