Is alignment reducible to becoming more coherent?
Cole Wyeth — 2025-04-22 — LessWrong
Summary
Proposes framing alignment as rationality enhancement by having observation-utility agents learn preferences through terminal interaction, treating manipulation and user changes as types of incoherence to be resolved rather than fundamental obstacles.
Source
- Link: https://lesswrong.com/posts/nuDJNyG5XLQjtvaeg/is-alignment-reducible-to-becoming-more-coherent
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- agent-foundations — Theory