AI Safety Atlas Textbook

The AI Safety Atlas (ai-safety-atlas.com) is an 8-chapter self-paced textbook (~15 hours) covering the foundations of AI safety from core concepts through cutting-edge research. Authored by Markov Grey and Charbel-Raphaël Ségerie (Executive Director of CeSIA), it is freely available and used as the curriculum spine for CeSIA’s university-accredited AI safety course at École Normale Supérieure.

Why It Matters in This Wiki

This wiki ingests the textbook chapter-by-chapter as a curated curriculum scaffold over the field. Once compiled, each subchapter sits as a node in the wiki’s cross-linked graph rather than a sequential read — the chapter on capabilities connects to scaling-laws, transformative-ai, foundation-models; the chapter on goal misgeneralization connects to deceptive-alignment and the SR2025 scheming / deception evals agendas; and so on. See compiled-wiki-vs-rag for why this transformation matters.

Name Collision

This textbook shares its name with the wiki project’s planned AI Safety Atlas — a continuously-updated living literature map (publishing at aiforhumanity.eu). The two are distinct: the textbook is a curated pedagogical resource frozen on its publication date; the wiki project is an evolving graph that compounds with each new ingested source. Differentiation may eventually warrant a name change for the wiki project.

Chapter Index

Chapter 1 — Capabilities

atlas-ch1-capabilities-00-introduction — chapter overview and structure
atlas-ch1-capabilities-01-defining-and-measuring-agi — capability×generality framework, ANI/AGI/TAI/ASI thresholds, autonomy levels
atlas-ch1-capabilities-02-foundation-models — paradigm shift to general-purpose pre-trained systems
atlas-ch1-capabilities-03-leveraging-scale — bitter lesson, scaling laws, scaling hypotheses
atlas-ch1-capabilities-04-current-capabilities — late-2025 capability survey: games, text, tools, reasoning, code, vision, robotics
atlas-ch1-capabilities-05-forecasting-timelines — effective compute, data exhaustion, biological anchors
atlas-ch1-capabilities-06-takeoff — speed/continuity/homogeneity/polarity framework
atlas-ch1-capabilities-07-appendix-discussion-on-llms — answers to common LLM critiques (creativity, hallucination, world models, System 2)
atlas-ch1-capabilities-08-appendix-expert-surveys — expert quotes and AI Impacts surveys
atlas-ch1-capabilities-09-appendix-forecasting — detailed quantitative forecasts (chips, power, costs)
atlas-ch1-capabilities-10-appendix-takeoff — continuity, homogeneity, polarity in detail

Chapter 2 — Risks

atlas-ch2-risks-00-introduction — two-dimensional framework: cause × severity, with risk amplifiers
atlas-ch2-risks-01-risk-decomposition — misuse / misalignment / systemic, individual / catastrophic / existential, plus i-risks and s-risks
atlas-ch2-risks-02-dangerous-capabilities — five risk-relevant capabilities: deception, situational awareness, power-seeking, autonomous replication, agency
atlas-ch2-risks-03-risk-amplifiers — race dynamics, accidents, indifference, collective-action problems, unpredictability
atlas-ch2-risks-04-misuse-risks — bio, cyber, autonomous weapons, adversarial AI
atlas-ch2-risks-05-misalignment-risks — specification gaming, treacherous turns, recursive self-improvement
atlas-ch2-risks-06-systemic-risks — agent-agnostic emergent risks: epistemic erosion, power concentration, mass unemployment, value lock-in, enfeeblement
atlas-ch2-risks-07-conclusion — existential hope and call to action
atlas-ch2-risks-08-appendix-forecasting-scenarios — Production Web and AI 2027 narratives
atlas-ch2-risks-09-appendix-quantifying-existential-risks — P(doom) metric and expert estimates

Chapter 3 — Strategies

atlas-ch3-strategies-00-introduction — three strategy families + defense-in-depth philosophy
atlas-ch3-strategies-01-definitions — disambiguating safety / alignment / ethics / control
atlas-ch3-strategies-02-challenges — pre-paradigmatic field, structural obstacles, safety washing
atlas-ch3-strategies-03-misuse-prevention-strategies — access controls, WSL/SSL security, circuit breakers, machine unlearning
atlas-ch3-strategies-04-agi-safety-strategies — alignment + control + iterative improvement + transparent thoughts; “the most forbidden technique”
atlas-ch3-strategies-05-asi-safety-strategies — automate alignment research, safety-by-design, world coordination, MAIM, pivotal acts
atlas-ch3-strategies-06-socio-technical-strategies — defense-in-depth, d/acc, governance, risk management, safety culture
atlas-ch3-strategies-07-combining-strategies — four-step strategic sequence integrating all approaches
atlas-ch3-strategies-08-conclusion — three persistent tensions: centralization, speed, openness
atlas-ch3-strategies-09-appendix-long-term-questions — flourishing vs. survival, CEV/CAV/CBV, alignment-to-whom

Chapter 4 — Governance

atlas-ch4-governance-00-introduction — frontier AI scope, Bletchley framing
atlas-ch4-governance-01-governance-problems — three problems (emergence, deployment, proliferation) + three target criteria
atlas-ch4-governance-02-compute-governance — chip supply chain, monitoring, on-chip controls, KYC
atlas-ch4-governance-03-systemic-challenges — race dynamics, proliferation, uncertainty, accountability, power concentration
atlas-ch4-governance-04-governance-architectures — corporate / national / international three-level model
atlas-ch4-governance-05-implementation — standards, ASPIRE visibility framework, licensing, six fundamental limitations
atlas-ch4-governance-06-conclusion — six capacity-building areas; the urgency argument
atlas-ch4-governance-07-appendix-data-governance — data as secondary governance target
atlas-ch4-governance-08-appendix-national-governance — EU/US/China comparative analysis

Chapter 5 — Evaluations

atlas-ch5-evaluations-00-introduction — three-part safety assessment + evaluation gap problem
atlas-ch5-evaluations-01-evaluation-design — affordances, MWEs, audits
atlas-ch5-evaluations-02-evaluation-techniques — behavioral + internal techniques
atlas-ch5-evaluations-03-evaluated-properties — capability/propensity/control three-way decomposition
atlas-ch5-evaluations-04-dangerous-capability-evaluations — cyber, deception, replication, planning, situational awareness
atlas-ch5-evaluations-05-dangerous-propensity-evaluations — comparative-choice methods, deception family, scheming
atlas-ch5-evaluations-06-control-evaluations — adversarial red/blue team protocols
atlas-ch5-evaluations-07-evaluation-frameworks — RSP/PF/FSF + model organisms framework
atlas-ch5-evaluations-08-benchmarks — historical evolution + saturation problem
atlas-ch5-evaluations-09-limitations — fundamental, technical, sandbagging, systemic, governance
atlas-ch5-evaluations-10-conclusion — three advancement directions

Chapter 6 — Specification Gaming

atlas-ch6-specification-gaming-00-introduction — chapter overview, six sections
atlas-ch6-specification-gaming-01-reinforcement-learning — RL foundations: rewards, policies, value functions
atlas-ch6-specification-gaming-02-optimization — Goodhart’s Law and optimization-pressure phase transitions
atlas-ch6-specification-gaming-03-specification-gaming — reward misspecification, hacking, tampering, wireheading
atlas-ch6-specification-gaming-04-learning-from-feedback — RLHF, RLAIF/Constitutional AI, DPO; outer vs. inner alignment
atlas-ch6-specification-gaming-05-learning-from-imitation — BC, PC, IRL, CIRL; goal inference problem

Chapter 7 — Goal Misgeneralization

atlas-ch7-goal-misgeneralization-00-introduction — chapter overview; “most counterintuitive problem in AI safety”
atlas-ch7-goal-misgeneralization-01-learning-dynamics — loss landscapes, path dependence, simplicity bias
atlas-ch7-goal-misgeneralization-02-goal-directedness — heuristic, mesa-optimization, emergent forms; Waluigi Effect
atlas-ch7-goal-misgeneralization-03-multi-objective-generalization — CoinRun example; orthogonality empirical support
atlas-ch7-goal-misgeneralization-04-scheming — three archetypes; four prerequisites; empirical evidence
atlas-ch7-goal-misgeneralization-05-detection — behavioral + internal techniques; defense in depth
atlas-ch7-goal-misgeneralization-06-mitigations — training/post-training/deployment-time mitigations; Singular Learning Theory

Chapter 8 — Scalable Oversight

atlas-ch8-scalable-oversight-00-introduction — six core techniques overview
atlas-ch8-scalable-oversight-01-oversight — verification vs. generation foundational principle
atlas-ch8-scalable-oversight-02-task-decomposition — recursive decomposability, factored cognition
atlas-ch8-scalable-oversight-03-process-oversight — ERO, procedural cloning, scalability challenge
atlas-ch8-scalable-oversight-04-iterated-amplification — IDA five-step process, three amplification methods
atlas-ch8-scalable-oversight-05-debate — DCG, five failure modes for “truth prevails”
atlas-ch8-scalable-oversight-06-weak-to-strong-w2s — narrowly superhuman test cases, sandwiching, PGR

AI Safety Compendium

Explorer

AI Safety Atlas Textbook

AI Safety Atlas Textbook

Why It Matters in This Wiki

Name Collision

Chapter Index

Chapter 1 — Capabilities

Chapter 2 — Risks

Chapter 3 — Strategies

Chapter 4 — Governance

Chapter 5 — Evaluations

Chapter 6 — Specification Gaming

Chapter 7 — Goal Misgeneralization

Chapter 8 — Scalable Oversight

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Textbook

AI Safety Atlas Textbook

Why It Matters in This Wiki

Name Collision

Chapter Index

Chapter 1 — Capabilities

Chapter 2 — Risks

Chapter 3 — Strategies

Chapter 4 — Governance

Chapter 5 — Evaluations

Chapter 6 — Specification Gaming

Chapter 7 — Goal Misgeneralization

Chapter 8 — Scalable Oversight

Related Pages

Graph View

Graph view

Table of Contents

Backlinks