AI Safety Atlas Textbook
The AI Safety Atlas (ai-safety-atlas.com) is an 8-chapter self-paced textbook (~15 hours) covering the foundations of AI safety from core concepts through cutting-edge research. Authored by Markov Grey and Charbel-Raphaël Ségerie (Executive Director of CeSIA), it is freely available and used as the curriculum spine for CeSIA’s university-accredited AI safety course at École Normale Supérieure.
Why It Matters in This Wiki
This wiki ingests the textbook chapter-by-chapter as a curated curriculum scaffold over the field. Once compiled, each subchapter sits as a node in the wiki’s cross-linked graph rather than a sequential read — the chapter on capabilities connects to scaling-laws, transformative-ai, foundation-models; the chapter on goal misgeneralization connects to deceptive-alignment and the SR2025 scheming / deception evals agendas; and so on. See compiled-wiki-vs-rag for why this transformation matters.
Name Collision
This textbook shares its name with the wiki project’s planned AI Safety Atlas — a continuously-updated living literature map (publishing at aiforhumanity.eu). The two are distinct: the textbook is a curated pedagogical resource frozen on its publication date; the wiki project is an evolving graph that compounds with each new ingested source. Differentiation may eventually warrant a name change for the wiki project.
Chapter Index
Chapter 1 — Capabilities
- atlas-ch1-capabilities-00-introduction — chapter overview and structure
- atlas-ch1-capabilities-01-defining-and-measuring-agi — capability×generality framework, ANI/AGI/TAI/ASI thresholds, autonomy levels
- atlas-ch1-capabilities-02-foundation-models — paradigm shift to general-purpose pre-trained systems
- atlas-ch1-capabilities-03-leveraging-scale — bitter lesson, scaling laws, scaling hypotheses
- atlas-ch1-capabilities-04-current-capabilities — late-2025 capability survey: games, text, tools, reasoning, code, vision, robotics
- atlas-ch1-capabilities-05-forecasting-timelines — effective compute, data exhaustion, biological anchors
- atlas-ch1-capabilities-06-takeoff — speed/continuity/homogeneity/polarity framework
- atlas-ch1-capabilities-07-appendix-discussion-on-llms — answers to common LLM critiques (creativity, hallucination, world models, System 2)
- atlas-ch1-capabilities-08-appendix-expert-surveys — expert quotes and AI Impacts surveys
- atlas-ch1-capabilities-09-appendix-forecasting — detailed quantitative forecasts (chips, power, costs)
- atlas-ch1-capabilities-10-appendix-takeoff — continuity, homogeneity, polarity in detail
Chapter 2 — Risks
- atlas-ch2-risks-00-introduction — two-dimensional framework: cause × severity, with risk amplifiers
- atlas-ch2-risks-01-risk-decomposition — misuse / misalignment / systemic, individual / catastrophic / existential, plus i-risks and s-risks
- atlas-ch2-risks-02-dangerous-capabilities — five risk-relevant capabilities: deception, situational awareness, power-seeking, autonomous replication, agency
- atlas-ch2-risks-03-risk-amplifiers — race dynamics, accidents, indifference, collective-action problems, unpredictability
- atlas-ch2-risks-04-misuse-risks — bio, cyber, autonomous weapons, adversarial AI
- atlas-ch2-risks-05-misalignment-risks — specification gaming, treacherous turns, recursive self-improvement
- atlas-ch2-risks-06-systemic-risks — agent-agnostic emergent risks: epistemic erosion, power concentration, mass unemployment, value lock-in, enfeeblement
- atlas-ch2-risks-07-conclusion — existential hope and call to action
- atlas-ch2-risks-08-appendix-forecasting-scenarios — Production Web and AI 2027 narratives
- atlas-ch2-risks-09-appendix-quantifying-existential-risks — P(doom) metric and expert estimates
Chapter 3 — Strategies
- atlas-ch3-strategies-00-introduction — three strategy families + defense-in-depth philosophy
- atlas-ch3-strategies-01-definitions — disambiguating safety / alignment / ethics / control
- atlas-ch3-strategies-02-challenges — pre-paradigmatic field, structural obstacles, safety washing
- atlas-ch3-strategies-03-misuse-prevention-strategies — access controls, WSL/SSL security, circuit breakers, machine unlearning
- atlas-ch3-strategies-04-agi-safety-strategies — alignment + control + iterative improvement + transparent thoughts; “the most forbidden technique”
- atlas-ch3-strategies-05-asi-safety-strategies — automate alignment research, safety-by-design, world coordination, MAIM, pivotal acts
- atlas-ch3-strategies-06-socio-technical-strategies — defense-in-depth, d/acc, governance, risk management, safety culture
- atlas-ch3-strategies-07-combining-strategies — four-step strategic sequence integrating all approaches
- atlas-ch3-strategies-08-conclusion — three persistent tensions: centralization, speed, openness
- atlas-ch3-strategies-09-appendix-long-term-questions — flourishing vs. survival, CEV/CAV/CBV, alignment-to-whom
Chapter 4 — Governance
- atlas-ch4-governance-00-introduction — frontier AI scope, Bletchley framing
- atlas-ch4-governance-01-governance-problems — three problems (emergence, deployment, proliferation) + three target criteria
- atlas-ch4-governance-02-compute-governance — chip supply chain, monitoring, on-chip controls, KYC
- atlas-ch4-governance-03-systemic-challenges — race dynamics, proliferation, uncertainty, accountability, power concentration
- atlas-ch4-governance-04-governance-architectures — corporate / national / international three-level model
- atlas-ch4-governance-05-implementation — standards, ASPIRE visibility framework, licensing, six fundamental limitations
- atlas-ch4-governance-06-conclusion — six capacity-building areas; the urgency argument
- atlas-ch4-governance-07-appendix-data-governance — data as secondary governance target
- atlas-ch4-governance-08-appendix-national-governance — EU/US/China comparative analysis
Chapter 5 — Evaluations
- atlas-ch5-evaluations-00-introduction — three-part safety assessment + evaluation gap problem
- atlas-ch5-evaluations-01-evaluation-design — affordances, MWEs, audits
- atlas-ch5-evaluations-02-evaluation-techniques — behavioral + internal techniques
- atlas-ch5-evaluations-03-evaluated-properties — capability/propensity/control three-way decomposition
- atlas-ch5-evaluations-04-dangerous-capability-evaluations — cyber, deception, replication, planning, situational awareness
- atlas-ch5-evaluations-05-dangerous-propensity-evaluations — comparative-choice methods, deception family, scheming
- atlas-ch5-evaluations-06-control-evaluations — adversarial red/blue team protocols
- atlas-ch5-evaluations-07-evaluation-frameworks — RSP/PF/FSF + model organisms framework
- atlas-ch5-evaluations-08-benchmarks — historical evolution + saturation problem
- atlas-ch5-evaluations-09-limitations — fundamental, technical, sandbagging, systemic, governance
- atlas-ch5-evaluations-10-conclusion — three advancement directions
Chapter 6 — Specification Gaming
- atlas-ch6-specification-gaming-00-introduction — chapter overview, six sections
- atlas-ch6-specification-gaming-01-reinforcement-learning — RL foundations: rewards, policies, value functions
- atlas-ch6-specification-gaming-02-optimization — Goodhart’s Law and optimization-pressure phase transitions
- atlas-ch6-specification-gaming-03-specification-gaming — reward misspecification, hacking, tampering, wireheading
- atlas-ch6-specification-gaming-04-learning-from-feedback — RLHF, RLAIF/Constitutional AI, DPO; outer vs. inner alignment
- atlas-ch6-specification-gaming-05-learning-from-imitation — BC, PC, IRL, CIRL; goal inference problem
Chapter 7 — Goal Misgeneralization
- atlas-ch7-goal-misgeneralization-00-introduction — chapter overview; “most counterintuitive problem in AI safety”
- atlas-ch7-goal-misgeneralization-01-learning-dynamics — loss landscapes, path dependence, simplicity bias
- atlas-ch7-goal-misgeneralization-02-goal-directedness — heuristic, mesa-optimization, emergent forms; Waluigi Effect
- atlas-ch7-goal-misgeneralization-03-multi-objective-generalization — CoinRun example; orthogonality empirical support
- atlas-ch7-goal-misgeneralization-04-scheming — three archetypes; four prerequisites; empirical evidence
- atlas-ch7-goal-misgeneralization-05-detection — behavioral + internal techniques; defense in depth
- atlas-ch7-goal-misgeneralization-06-mitigations — training/post-training/deployment-time mitigations; Singular Learning Theory
Chapter 8 — Scalable Oversight
- atlas-ch8-scalable-oversight-00-introduction — six core techniques overview
- atlas-ch8-scalable-oversight-01-oversight — verification vs. generation foundational principle
- atlas-ch8-scalable-oversight-02-task-decomposition — recursive decomposability, factored cognition
- atlas-ch8-scalable-oversight-03-process-oversight — ERO, procedural cloning, scalability challenge
- atlas-ch8-scalable-oversight-04-iterated-amplification — IDA five-step process, three amplification methods
- atlas-ch8-scalable-oversight-05-debate — DCG, five failure modes for “truth prevails”
- atlas-ch8-scalable-oversight-06-weak-to-strong-w2s — narrowly superhuman test cases, sandwiching, PGR
Related Pages
- markov-grey
- charbel-raphael-segerie
- cesia
- ai-safety
- read-the-textbook-ai-safety-atlas
- compiled-wiki-vs-rag
- knowledge-compounding