Murphys Laws of AI Alignment: Why the Gap Always Wins

Madhava Gaikwad — 2025-09-04 — arXiv

Summary

Proves a formal impossibility theorem showing that RLHF under misspecification requires exponentially many samples to distinguish true reward functions when feedback is systematically biased on rare contexts, unless you can identify where feedback is unreliable.

Key Result

Any learning algorithm needs exp(nalphaepsilon^2) samples to overcome feedback bias on fraction alpha of contexts with bias strength epsilon, but a calibration oracle reduces this to O(1/(alpha*epsilon^2)) queries.

Source

Link: https://arxiv.org/abs/2509.05381
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- assistance-games-assistive-agents — Black-box safety (understand and control current model behaviour) / Goal robustness

assistance-games-assistive-agents

AI Safety Compendium

Explorer

Murphys Laws of AI Alignment: Why the Gap Always Wins

Murphys Laws of AI Alignment: Why the Gap Always Wins

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Murphys Laws of AI Alignment: Why the Gap Always Wins

Murphys Laws of AI Alignment: Why the Gap Always Wins

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents