Risk Models
Failure-mode taxonomy and threat-model analysis for advanced AI: deceptive-alignment, scheming, mesa-optimization, goal-misgeneralization, specification-gaming, reward-hacking, goodharts-law, instrumental-convergence, power-seeking, ai-takeover-scenarios, and the meta-arguments in ai-risk-arguments and existential-risk. The cluster supplies the substantive content the rest of the safety field’s interventions are designed to address.
This pillar is where the field’s central tension lives — the gap between behavioral evaluation (what current alignment methods rely on) and the most concerning failure modes (which are designed to defeat behavioral evaluation). outer-vs-inner-alignment is the framing that locates the gap; ai-control is the operational response when the gap can’t be closed.