Misalignment From Treating Means as Ends

Henrik Marklund, Alex Infanger, Benjamin Van Roy — 2025-07-15 — arXiv

Summary

Formulates a theoretical example demonstrating how even slight conflation of instrumental and terminal goals in reward functions results in severe misalignment, and analyzes what environmental properties make reinforcement learning highly sensitive to this conflation.

Key Result

Even slight conflation of instrumental and terminal goals in reward specification results in severe misalignment when optimizing the misspecified reward function compared to the true reward function.

Source

Link: https://arxiv.org/abs/2507.10995
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- rl-safety — Black-box safety (understand and control current model behaviour) / Goal robustness

rl-safety

AI Safety Compendium

Explorer

Misalignment From Treating Means as Ends

Misalignment From Treating Means as Ends

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Misalignment From Treating Means as Ends

Misalignment From Treating Means as Ends

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents