Murphys Laws of AI Alignment: Why the Gap Always Wins
Madhava Gaikwad — 2025-09-04 — arXiv
Summary
Proves a formal impossibility theorem showing that RLHF under misspecification requires exponentially many samples to distinguish true reward functions when feedback is systematically biased on rare contexts, unless you can identify where feedback is unreliable.
Key Result
Any learning algorithm needs exp(nalphaepsilon^2) samples to overcome feedback bias on fraction alpha of contexts with bias strength epsilon, but a calibration oracle reduces this to O(1/(alpha*epsilon^2)) queries.
Source
- Link: https://arxiv.org/abs/2509.05381
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- assistance-games-assistive-agents — Black-box safety (understand and control current model behaviour) / Goal robustness