Risk Models

Failure-mode taxonomy and threat-model analysis for advanced AI: deceptive-alignment, scheming, mesa-optimization, goal-misgeneralization, specification-gaming, reward-hacking, goodharts-law, instrumental-convergence, power-seeking, ai-takeover-scenarios, and the meta-arguments in ai-risk-arguments and existential-risk. The cluster supplies the substantive content the rest of the safety field’s interventions are designed to address.

This pillar is where the field’s central tension lives — the gap between behavioral evaluation (what current alignment methods rely on) and the most concerning failure modes (which are designed to defeat behavioral evaluation). outer-vs-inner-alignment is the framing that locates the gap; ai-control is the operational response when the gap can’t be closed.

AI Safety Compendium

Explorer

Risk Models

Risk Models

Pages tagged here

AI Risk Arguments

AI Takeover Scenarios

Dangerous Capabilities

Existential Risk

Goal Misgeneralization

Goodhart's Law

Instrumental Convergence

Mesa-Optimization

Outer vs. Inner Alignment

Power Seeking

Reward Hacking

Scheming

Situational Awareness

Specification Gaming

Deceptive Alignment