Inference-Time Reward Hacking in Large Language Models
Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio du Pin Calmon — 2025-06-24 — arXiv
Summary
Characterizes reward hacking in inference-time alignment methods (Best-of-n, Soft-Best-of-n) and introduces Best-of-Poisson and HedgeTune algorithm to mitigate reward hacking by hedging on proxy reward signals.
Key Result
Hedging mitigates reward hacking and achieves superior distortion-reward tradeoffs, with the characteristic hacking pattern (true reward first increasing then declining) shown to be an inevitable property of inference-time mechanisms.
Source
- Link: https://arxiv.org/abs/2506.19248
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- inference-time-in-context-learning — Black-box safety (understand and control current model behaviour) / Iterative alignment