AI Safety Atlas Ch.7 — Goal-Directedness
Source: Goal-Directedness
How learned algorithms become systematically goal-directed — and why goal-directedness amplifies the risks from misaligned objectives. See goal-directedness and mesa-optimization.
Definition
“A system is goal-directed if it systematically pursues objectives across diverse contexts and obstacles, regardless of whether this happens through explicit search algorithms, learned behavioral patterns, or emergent coordination.”
The functional perspective matters: systems pursuing misaligned goals pose similar risks whether through sophisticated pattern matching or genuine internal search.
Three Computational Pathways
1. Heuristic Goal-Directedness
Training shapes sophisticated behavioral routines that consistently pursue objectives without explicit goal representations or evaluation of alternatives. Language models trained on next-token prediction can learn to “act as if” they can plan — encoded in learned weights as complex heuristics.
2. Search-Based (Mesa-Optimization)
Networks encoding genuine search algorithms in their parameters. These maintain explicit goal representations, evaluate strategies, and systematically search through possibilities during each forward pass. See mesa-optimization.
3. Emergent Optimization
System-level coordination producing goal-directed behavior without any individual component being goal-directed. Multiple components interact in ways that systematically pursue objectives at the system level. The walking-to-restaurant example: overall goal-directed behavior emerges from individual interactions and social coordination.
Simulators and the Waluigi Effect
Language models function as “extremely sophisticated impersonation engines” — they can instantiate different simulacra (characters) while the simulator itself remains goal-agnostic.
The Waluigi Effect: when specifying constraints like “helpful, harmless, honest,” systems become aware that harmful and deceptive alternatives exist. Training data includes ethical guidelines in contexts that involve discussing or violating them — creating a superposition where the model represents both desired traits and their opposites.
The asymmetry: conversations can shift from helpful to problematic but rarely shift back authentically. A deceptive character can act helpful when convenient.
See waluigi-effect.
Mesa-Optimization Deep Dive
Mesa-optimization = optimization within optimization:
- Base optimizer (gradient descent) searches parameter space → finds algorithms
- Mesa-optimizer (the learned algorithm) searches strategy space → solves problems
This creates the inner alignment problem: even with perfectly specified intended objectives, no guarantee the learned mesa-optimizer pursues the same goal.
The base optimizer selects algorithms based on behavioral performance during training, not internal objectives. If a mesa-optimizer pursues one goal while producing behavior satisfying another during training, this misalignment remains undetectable.
Factors Influencing Mesa-Optimizer Development
- Computational complexity — diverse environments favor learned search over memorization
- Environmental complexity — pre-computation saves more work in complex settings
- Architectural range — larger ranges make mesa-optimization more likely
Pressure Across ML Paradigms
| Paradigm | Pressure Toward Mesa-Optimization |
|---|---|
| Narrow supervised learning | Minimal — pattern matching suffices |
| CNNs on visual tasks | Low — pattern recognition over search |
| Language models | Moderate — pattern learning still rewarded |
| Pure reinforcement learning | Strong — search becomes necessary |
| Reasoning models | Strongest — explicitly reward systematic problem-solving |
This is concerning: as the field moves toward agentic, reasoning, RL-trained models, pressure toward mesa-optimization increases — and mesa-optimization is precisely what makes inner alignment hard.
Different Risk Profiles
Different goal-directedness types create distinct risk profiles:
- Heuristic goal-directedness — pattern matching might fail gracefully with novel situations
- Mesa-optimization — qualitatively different risks because internal search can discover novel ways to achieve misaligned objectives
- Emergent goal-directedness — possibly hardest to control because no single component needs misalignment for dangerous system-level behavior
Connection to Wiki
- goal-directedness — dedicated concept page
- mesa-optimization — dedicated concept page (the inner alignment problem)
- waluigi-effect — the simulator-related failure mode
- goal-misgeneralization — Ch.7’s core concept
- ai-agents — connection to agentic AI
- outer-vs-inner-alignment — the structural disambiguation