Murphys Laws of AI Alignment: Why the Gap Always Wins

Madhava Gaikwad — 2025-09-04 — arXiv

Summary

Proves a formal impossibility theorem showing that RLHF under misspecification requires exponentially many samples to distinguish true reward functions when feedback is systematically biased on rare contexts, unless you can identify where feedback is unreliable.

Key Result

Any learning algorithm needs exp(nalphaepsilon^2) samples to overcome feedback bias on fraction alpha of contexts with bias strength epsilon, but a calibration oracle reduces this to O(1/(alpha*epsilon^2)) queries.

Source