AI Safety Compendium

Proof of concept — not yet open for public collaboration. This site is an early-stage, single-maintainer prototype of a living, cross-referenced map of AI safety. The source repository is currently private; the rendered site is reachable, but the workflow, scope, and editorial conventions are still being shaken down. Feedback is welcome by email — see Connect & contribute below.

The living, cross-referenced map of AI safety — technical research, governance and policy, risk analysis, and field-shaping commentary. Updated weekly. Every claim cited.

Start at overview for the field-level picture, or jump to a section below.

How the compendium is organised

Four page types compile the literature (full methodology):

  • Concepts (/concepts) — single ideas like deceptive alignment, mechanistic interpretability, scalable oversight. Each page collects what the field knows: definition, why it matters, key results (each cited), open questions, the agendas working on it.
  • Agendas (/agendas) — research programs and approaches. Each lists lead orgs and people, current state, recent papers, historical foundations, and open problems.
  • Summaries (/summaries) — one page per ingested source (paper, post, blog). Source metadata, TL;DR, key claims, methods, limitations, and how the source updates the compendium’s concepts and agendas.
  • Entities (/entities) — thin routing pages for orgs and researchers, so cross-references resolve.

Where to start

The 30 most foundational concepts and 8 most active research agendas are linked below. The full catalog is bigger; these are the load-bearing pages a reader new to the field should anchor on.

Failure modes (concepts)

Structural arguments (concepts)

Interventions (concepts)

Active research agendas

  • control — assume worst-case misalignment; bound consequences via deployment protocols. (Redwood Research; ~22 papers/year.)
  • chain-of-thought-monitoring — read reasoning traces for evidence of misbehavior. (OpenAI, Anthropic, DeepMind, Apollo; ~17 papers/year.)
  • reverse-engineering — circuits, SAEs, attribution graphs. (Anthropic, DeepMind; ~33 papers/year.)
  • capability-evals — empirical thresholds for frontier-safety frameworks. (METR, AISIs; ~34 papers/year.)
  • iterative-alignment-at-post-train-time — RLHF / DPO / Constitutional AI productionization. (Most of the industry.)
  • character-training-and-persona-steering — shape the model’s effective character above the raw RLHF layer. (Anthropic, OpenAI.)
  • debate — adversarial scalable oversight. (DeepMind, OpenAI.)
  • weak-to-strong-generalization — empirical methodology for studying superhuman alignment now. (OpenAI Superalignment lineage.)

Editorial standards

  • See about for who builds the Compendium and why.
  • See editorial-policy for the citation rule, contradiction handling, and corrections process.
  • See methodology for the full operational detail — weekly cadence, LLM-assistance split, lint discipline, and how the audit trail works while the repo is private.

Connect & contribute

The Compendium is currently a proof of concept maintained by one person. Public collaboration via pull requests, issues, or the GitHub repo is not open at this stage. The channels that work today:

  • Suggest a sourcesuggest sends the URL to the maintainer’s private review queue.
  • Connect from your AI toolconnect documents the hosted MCP server (mcp.aiforhumanity.eu) and zero-config web endpoints.
  • Send feedback or a correction — email kevin@itforhumanity.be with the page URL and what should change. Substantive corrections are logged on the affected page.

Cadence

The weekly sweep runs Monday 06:00 UTC. Approved candidates from each week’s review are compiled into summary pages and linked from the concepts and agendas they touch. The maintainer keeps a chronological operation log internally; while the repo is private it is shared on request.