AI Safety Compendium

Proof of concept — not yet open for public collaboration. This site is an early-stage, single-maintainer prototype of a living, cross-referenced map of AI safety. The source repository is currently private; the rendered site is reachable, but the workflow, scope, and editorial conventions are still being shaken down. Feedback is welcome by email — see Connect & contribute below.

The living, cross-referenced map of AI safety — technical research, governance and policy, risk analysis, and field-shaping commentary. Updated weekly. Every claim cited.

Start at overview for the field-level picture, or jump to a section below.

How the compendium is organised

Four page types compile the literature (full methodology):

Concepts (/concepts) — single ideas like deceptive alignment, mechanistic interpretability, scalable oversight. Each page collects what the field knows: definition, why it matters, key results (each cited), open questions, the agendas working on it.
Agendas (/agendas) — research programs and approaches. Each lists lead orgs and people, current state, recent papers, historical foundations, and open problems.
Summaries (/summaries) — one page per ingested source (paper, post, blog). Source metadata, TL;DR, key claims, methods, limitations, and how the source updates the compendium’s concepts and agendas.
Entities (/entities) — thin routing pages for orgs and researchers, so cross-references resolve.

Where to start

The 30 most foundational concepts and 8 most active research agendas are linked below. The full catalog is bigger; these are the load-bearing pages a reader new to the field should anchor on.

Failure modes (concepts)

deceptive-alignment — strategic alignment-faking during training.
scheming — synonymous with deceptive alignment; Carlsmith’s 4-prerequisite decomposition.
mesa-optimization — the structural mechanism that makes inner-alignment failure plausible.
goal-misgeneralization — agents that learn the wrong goal even with correct training signal.
specification-gaming — outer-alignment failure; “doing what we asked but not what we meant.”
reward-hacking — proxy-reward exploitation; phase transitions under optimization pressure.
goodharts-law — the structural reason specification is hard.
outer-vs-inner-alignment — the foundational decomposition.
situational-awareness — the capability that enables the most concerning failure modes.

Structural arguments (concepts)

instrumental-convergence — convergent sub-goals across diverse terminal goals.
power-seeking — instrumental convergence’s safety-relevant cluster.
ai-takeover-scenarios — concrete pathways from misalignment to catastrophe.
existential-risk — the irrecoverable risk class.
takeoff-dynamics — the four-dimensional structure that determines response time.
ai-risk-arguments — the meta-level argument structure and its critiques.

Interventions (concepts)

ai-alignment — the parent technical problem.
ai-control — protocol-level safety even if alignment fails.
interpretability — verification beyond behavioral evaluation.
mechanistic-interpretability — circuits, features, attribution graphs.
scalable-oversight — supervising past human-capability ceiling.
rlhf — current dominant alignment technique and its known limits.
constitutional-ai — RLAIF; the first scalable alternative to pure RLHF.
iterative-amplification — Christiano’s recursive bootstrapping proposal.
superalignment — the program for systems beyond human evaluation.
capability-evaluations — measure when models cross risk thresholds.
responsible-scaling-policy — frontier-lab if-then commitments.
model-organisms-of-misalignment — deliberately-trained testbeds for detection methods.
dangerous-capabilities — the five canonical risk-relevant capability families.
ai-governance — corporate / national / international coordination.
ai-safety — the umbrella field.

Active research agendas

control — assume worst-case misalignment; bound consequences via deployment protocols. (Redwood Research; ~22 papers/year.)
chain-of-thought-monitoring — read reasoning traces for evidence of misbehavior. (OpenAI, Anthropic, DeepMind, Apollo; ~17 papers/year.)
reverse-engineering — circuits, SAEs, attribution graphs. (Anthropic, DeepMind; ~33 papers/year.)
capability-evals — empirical thresholds for frontier-safety frameworks. (METR, AISIs; ~34 papers/year.)
iterative-alignment-at-post-train-time — RLHF / DPO / Constitutional AI productionization. (Most of the industry.)
character-training-and-persona-steering — shape the model’s effective character above the raw RLHF layer. (Anthropic, OpenAI.)
debate — adversarial scalable oversight. (DeepMind, OpenAI.)
weak-to-strong-generalization — empirical methodology for studying superhuman alignment now. (OpenAI Superalignment lineage.)

Editorial standards

See about for who builds the Compendium and why.
See editorial-policy for the citation rule, contradiction handling, and corrections process.
See methodology for the full operational detail — weekly cadence, LLM-assistance split, lint discipline, and how the audit trail works while the repo is private.

Connect & contribute

The Compendium is currently a proof of concept maintained by one person. Public collaboration via pull requests, issues, or the GitHub repo is not open at this stage. The channels that work today:

Suggest a source — suggest sends the URL to the maintainer’s private review queue.
Connect from your AI tool — connect documents the hosted MCP server (mcp.aiforhumanity.eu) and zero-config web endpoints.
Send feedback or a correction — email kevin@itforhumanity.be with the page URL and what should change. Substantive corrections are logged on the affected page.

Cadence

The weekly sweep runs Monday 06:00 UTC. Approved candidates from each week’s review are compiled into summary pages and linked from the concepts and agendas they touch. The maintainer keeps a chronological operation log internally; while the repo is private it is shared on request.