Inference-Time Reward Hacking in Large Language Models

Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio du Pin Calmon — 2025-06-24 — arXiv

Summary

Characterizes reward hacking in inference-time alignment methods (Best-of-n, Soft-Best-of-n) and introduces Best-of-Poisson and HedgeTune algorithm to mitigate reward hacking by hedging on proxy reward signals.

Key Result

Hedging mitigates reward hacking and achieves superior distortion-reward tradeoffs, with the characteristic hacking pattern (true reward first increasing then declining) shown to be an inevitable property of inference-time mechanisms.

Source

Link: https://arxiv.org/abs/2506.19248
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- inference-time-in-context-learning — Black-box safety (understand and control current model behaviour) / Iterative alignment

inference-time-in-context-learning

AI Safety Compendium

Explorer

Inference-Time Reward Hacking in Large Language Models

Inference-Time Reward Hacking in Large Language Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Inference-Time Reward Hacking in Large Language Models

Inference-Time Reward Hacking in Large Language Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents