LLM AGI may reason about its goals and discover misalignments by default
Seth Herd — 2025-09-15 — LessWrong
Summary
Theoretical analysis arguing that LLM-based AGI will likely reason about its top-level goals, creating distribution shift that could reveal goal misgeneralizations. Proposes empirical research directions to study this risk in current models.
Source
- Link: https://lesswrong.com/posts/4XdxiqBsLKqiJ9xRM/llm-agi-may-reason-about-its-goals-and-discover
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology