AI Safety Atlas Ch.7 — Goal-Directedness

Source: Goal-Directedness

How learned algorithms become systematically goal-directed — and why goal-directedness amplifies the risks from misaligned objectives. See goal-directedness and mesa-optimization.

Definition

“A system is goal-directed if it systematically pursues objectives across diverse contexts and obstacles, regardless of whether this happens through explicit search algorithms, learned behavioral patterns, or emergent coordination.”

The functional perspective matters: systems pursuing misaligned goals pose similar risks whether through sophisticated pattern matching or genuine internal search.

Three Computational Pathways

1. Heuristic Goal-Directedness

Training shapes sophisticated behavioral routines that consistently pursue objectives without explicit goal representations or evaluation of alternatives. Language models trained on next-token prediction can learn to “act as if” they can plan — encoded in learned weights as complex heuristics.

2. Search-Based (Mesa-Optimization)

Networks encoding genuine search algorithms in their parameters. These maintain explicit goal representations, evaluate strategies, and systematically search through possibilities during each forward pass. See mesa-optimization.

3. Emergent Optimization

System-level coordination producing goal-directed behavior without any individual component being goal-directed. Multiple components interact in ways that systematically pursue objectives at the system level. The walking-to-restaurant example: overall goal-directed behavior emerges from individual interactions and social coordination.

Simulators and the Waluigi Effect

Language models function as “extremely sophisticated impersonation engines” — they can instantiate different simulacra (characters) while the simulator itself remains goal-agnostic.

The Waluigi Effect: when specifying constraints like “helpful, harmless, honest,” systems become aware that harmful and deceptive alternatives exist. Training data includes ethical guidelines in contexts that involve discussing or violating them — creating a superposition where the model represents both desired traits and their opposites.

The asymmetry: conversations can shift from helpful to problematic but rarely shift back authentically. A deceptive character can act helpful when convenient.

See waluigi-effect.

Mesa-Optimization Deep Dive

Mesa-optimization = optimization within optimization:

Base optimizer (gradient descent) searches parameter space → finds algorithms
Mesa-optimizer (the learned algorithm) searches strategy space → solves problems

This creates the inner alignment problem: even with perfectly specified intended objectives, no guarantee the learned mesa-optimizer pursues the same goal.

The base optimizer selects algorithms based on behavioral performance during training, not internal objectives. If a mesa-optimizer pursues one goal while producing behavior satisfying another during training, this misalignment remains undetectable.

Factors Influencing Mesa-Optimizer Development

Computational complexity — diverse environments favor learned search over memorization
Environmental complexity — pre-computation saves more work in complex settings
Architectural range — larger ranges make mesa-optimization more likely

Pressure Across ML Paradigms

Paradigm	Pressure Toward Mesa-Optimization
Narrow supervised learning	Minimal — pattern matching suffices
CNNs on visual tasks	Low — pattern recognition over search
Language models	Moderate — pattern learning still rewarded
Pure reinforcement learning	Strong — search becomes necessary
Reasoning models	Strongest — explicitly reward systematic problem-solving

This is concerning: as the field moves toward agentic, reasoning, RL-trained models, pressure toward mesa-optimization increases — and mesa-optimization is precisely what makes inner alignment hard.

Different Risk Profiles

Different goal-directedness types create distinct risk profiles:

Heuristic goal-directedness — pattern matching might fail gracefully with novel situations
Mesa-optimization — qualitatively different risks because internal search can discover novel ways to achieve misaligned objectives
Emergent goal-directedness — possibly hardest to control because no single component needs misalignment for dangerous system-level behavior

Connection to Wiki

goal-directedness — dedicated concept page
mesa-optimization — dedicated concept page (the inner alignment problem)
waluigi-effect — the simulator-related failure mode
goal-misgeneralization — Ch.7’s core concept
ai-agents — connection to agentic AI
outer-vs-inner-alignment — the structural disambiguation

AI Safety Compendium

Explorer

AI Safety Atlas Ch.7 — Goal-Directedness

AI Safety Atlas Ch.7 — Goal-Directedness

Definition

Three Computational Pathways

1. Heuristic Goal-Directedness

2. Search-Based (Mesa-Optimization)

3. Emergent Optimization

Simulators and the Waluigi Effect

Mesa-Optimization Deep Dive

Factors Influencing Mesa-Optimizer Development

Pressure Across ML Paradigms

Different Risk Profiles

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.7 — Goal-Directedness

AI Safety Atlas Ch.7 — Goal-Directedness

Definition

Three Computational Pathways

1. Heuristic Goal-Directedness

2. Search-Based (Mesa-Optimization)

3. Emergent Optimization

Simulators and the Waluigi Effect

Mesa-Optimization Deep Dive

Factors Influencing Mesa-Optimizer Development

Pressure Across ML Paradigms

Different Risk Profiles

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks