Misalignment From Treating Means as Ends

Henrik Marklund, Alex Infanger, Benjamin Van Roy — 2025-07-15 — arXiv

Summary

Formulates a theoretical example demonstrating how even slight conflation of instrumental and terminal goals in reward functions results in severe misalignment, and analyzes what environmental properties make reinforcement learning highly sensitive to this conflation.

Key Result

Even slight conflation of instrumental and terminal goals in reward specification results in severe misalignment when optimizing the misspecified reward function compared to the true reward function.

Source