Inference-Time Reward Hacking in Large Language Models

Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio du Pin Calmon — 2025-06-24 — arXiv

Summary

Characterizes reward hacking in inference-time alignment methods (Best-of-n, Soft-Best-of-n) and introduces Best-of-Poisson and HedgeTune algorithm to mitigate reward hacking by hedging on proxy reward signals.

Key Result

Hedging mitigates reward hacking and achieves superior distortion-reward tradeoffs, with the characteristic hacking pattern (true reward first increasing then declining) shown to be an inevitable property of inference-time mechanisms.

Source