LLM AGI may reason about its goals and discover misalignments by default

Seth Herd — 2025-09-15 — LessWrong

Summary

Theoretical analysis arguing that LLM-based AGI will likely reason about its top-level goals, creating distribution shift that could reveal goal misgeneralizations. Proposes empirical research directions to study this risk in current models.

Source

Link: https://lesswrong.com/posts/4XdxiqBsLKqiJ9xRM/llm-agi-may-reason-about-its-goals-and-discover
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology

emergent-misalignment

AI Safety Compendium

Explorer

LLM AGI may reason about its goals and discover misalignments by default

LLM AGI may reason about its goals and discover misalignments by default

Summary

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

LLM AGI may reason about its goals and discover misalignments by default

LLM AGI may reason about its goals and discover misalignments by default

Summary

Source

Related Pages

Graph View

Graph view

Table of Contents