AI Safety Compendium
Proof of concept — not yet open for public collaboration. This site is an early-stage, single-maintainer prototype of a living, cross-referenced map of AI safety. The source repository is currently private; the rendered site is reachable, but the workflow, scope, and editorial conventions are still being shaken down. Feedback is welcome by email — see Connect & contribute below.
The living, cross-referenced map of AI safety — technical research, governance and policy, risk analysis, and field-shaping commentary. Updated weekly. Every claim cited.
Start at overview for the field-level picture, or jump to a section below.
How the compendium is organised
Four page types compile the literature (full methodology):
- Concepts (
/concepts) — single ideas like deceptive alignment, mechanistic interpretability, scalable oversight. Each page collects what the field knows: definition, why it matters, key results (each cited), open questions, the agendas working on it. - Agendas (
/agendas) — research programs and approaches. Each lists lead orgs and people, current state, recent papers, historical foundations, and open problems. - Summaries (
/summaries) — one page per ingested source (paper, post, blog). Source metadata, TL;DR, key claims, methods, limitations, and how the source updates the compendium’s concepts and agendas. - Entities (
/entities) — thin routing pages for orgs and researchers, so cross-references resolve.
Where to start
The 30 most foundational concepts and 8 most active research agendas are linked below. The full catalog is bigger; these are the load-bearing pages a reader new to the field should anchor on.
Failure modes (concepts)
- deceptive-alignment — strategic alignment-faking during training.
- scheming — synonymous with deceptive alignment; Carlsmith’s 4-prerequisite decomposition.
- mesa-optimization — the structural mechanism that makes inner-alignment failure plausible.
- goal-misgeneralization — agents that learn the wrong goal even with correct training signal.
- specification-gaming — outer-alignment failure; “doing what we asked but not what we meant.”
- reward-hacking — proxy-reward exploitation; phase transitions under optimization pressure.
- goodharts-law — the structural reason specification is hard.
- outer-vs-inner-alignment — the foundational decomposition.
- situational-awareness — the capability that enables the most concerning failure modes.
Structural arguments (concepts)
- instrumental-convergence — convergent sub-goals across diverse terminal goals.
- power-seeking — instrumental convergence’s safety-relevant cluster.
- ai-takeover-scenarios — concrete pathways from misalignment to catastrophe.
- existential-risk — the irrecoverable risk class.
- takeoff-dynamics — the four-dimensional structure that determines response time.
- ai-risk-arguments — the meta-level argument structure and its critiques.
Interventions (concepts)
- ai-alignment — the parent technical problem.
- ai-control — protocol-level safety even if alignment fails.
- interpretability — verification beyond behavioral evaluation.
- mechanistic-interpretability — circuits, features, attribution graphs.
- scalable-oversight — supervising past human-capability ceiling.
- rlhf — current dominant alignment technique and its known limits.
- constitutional-ai — RLAIF; the first scalable alternative to pure RLHF.
- iterative-amplification — Christiano’s recursive bootstrapping proposal.
- superalignment — the program for systems beyond human evaluation.
- capability-evaluations — measure when models cross risk thresholds.
- responsible-scaling-policy — frontier-lab if-then commitments.
- model-organisms-of-misalignment — deliberately-trained testbeds for detection methods.
- dangerous-capabilities — the five canonical risk-relevant capability families.
- ai-governance — corporate / national / international coordination.
- ai-safety — the umbrella field.
Active research agendas
- control — assume worst-case misalignment; bound consequences via deployment protocols. (Redwood Research; ~22 papers/year.)
- chain-of-thought-monitoring — read reasoning traces for evidence of misbehavior. (OpenAI, Anthropic, DeepMind, Apollo; ~17 papers/year.)
- reverse-engineering — circuits, SAEs, attribution graphs. (Anthropic, DeepMind; ~33 papers/year.)
- capability-evals — empirical thresholds for frontier-safety frameworks. (METR, AISIs; ~34 papers/year.)
- iterative-alignment-at-post-train-time — RLHF / DPO / Constitutional AI productionization. (Most of the industry.)
- character-training-and-persona-steering — shape the model’s effective character above the raw RLHF layer. (Anthropic, OpenAI.)
- debate — adversarial scalable oversight. (DeepMind, OpenAI.)
- weak-to-strong-generalization — empirical methodology for studying superhuman alignment now. (OpenAI Superalignment lineage.)
Editorial standards
- See about for who builds the Compendium and why.
- See editorial-policy for the citation rule, contradiction handling, and corrections process.
- See methodology for the full operational detail — weekly cadence, LLM-assistance split, lint discipline, and how the audit trail works while the repo is private.
Connect & contribute
The Compendium is currently a proof of concept maintained by one person. Public collaboration via pull requests, issues, or the GitHub repo is not open at this stage. The channels that work today:
- Suggest a source — suggest sends the URL to the maintainer’s private review queue.
- Connect from your AI tool — connect documents the hosted MCP server (
mcp.aiforhumanity.eu) and zero-config web endpoints. - Send feedback or a correction — email kevin@itforhumanity.be with the page URL and what should change. Substantive corrections are logged on the affected page.
Cadence
The weekly sweep runs Monday 06:00 UTC. Approved candidates from each week’s review are compiled into summary pages and linked from the concepts and agendas they touch. The maintainer keeps a chronological operation log internally; while the repo is private it is shared on request.