Deceptive Alignment and Homuncularity
Oliver Sourbut, TurnTrout — 2025-01-16 — UK AI Safety Institute, Independent — LessWrong
Summary
Dialogue critically examining the theoretical and evidential basis for deceptive alignment, particularly the ‘inner homunculus’ hypothesis that models develop persistent cross-context goals, distinguishing between different forms of deception (scaffolded, within-forward-pass, consistent-across-contexts).
Source
- Link: https://lesswrong.com/posts/9htmQx5wiePqTtZuL/deceptive-alignment-and-homuncularity
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- behavior-alignment-theory — Theory / Corrigibility