Deceptive Alignment and Homuncularity

Oliver Sourbut, TurnTrout — 2025-01-16 — UK AI Safety Institute, Independent — LessWrong

Summary

Dialogue critically examining the theoretical and evidential basis for deceptive alignment, particularly the ‘inner homunculus’ hypothesis that models develop persistent cross-context goals, distinguishing between different forms of deception (scaffolded, within-forward-pass, consistent-across-contexts).

Source