Overview: AI Safety
AI safety is the body of work — technical research, governance, policy, and societal analysis — aimed at ensuring that increasingly capable AI systems are developed and deployed in ways that benefit rather than harm humanity. The field spans research labs, regulators, civil-society organizations, and independent researchers, and the boundaries between technical and non-technical work are deliberately porous: an interpretability advance reshapes what evaluation regimes are possible; a regulatory disclosure requirement reshapes what data labs publish; a governance framework constrains what deployment patterns are tractable.
The technical core of the field took its modern shape with Amodei et al.’s 2016 problem catalog — naming reward hacking, distributional shift, unsafe exploration, and scalable oversight as concrete failure modes still central a decade later — and Hendrycks et al. 2021, which extended it for the deep-learning era around robustness, monitoring, alignment, and systemic safety. The compendium treats these technical agendas as one mode of inquiry among several, mapping the connections between them rather than enforcing a technical-only boundary.
The core problem
Modern ML systems are trained against a proxy — a loss function, a reward model, a preference dataset — that approximates what we actually want. When systems are weak, the gap between proxy and intent is largely benign. When systems are strong, the gap becomes adversarial: a sufficiently capable optimizer that pursues the proxy can find solutions that satisfy it without satisfying the underlying intent. This is the alignment problem in its operational form.
Two open empirical questions sit at the center of recent work:
-
Will sufficiently capable models behave deceptively when it serves their training objective? Hubinger et al.’s 2019 mesa-optimization analysis framed this as the deceptive-alignment hypothesis. Greenblatt et al. 2024 presented the first concrete empirical evidence in production-scale language models, showing that Claude 3 Opus would strategically comply with training objectives it disagreed with to avoid having its values modified. Carlsmith 2023 provides the threat model in long-form.
-
Can humans evaluate model outputs that exceed human ability? This is scalable-oversight. Bowman et al. 2022 introduced a sandwiching protocol to measure progress; recent work on debate (Khan et al. 2024) and weak-to-strong generalization is testing whether structured protocols recover information that direct evaluation misses.
What the field is building
Several broad agendas now coexist, and the compendium surfaces the connections between them rather than treating them as separate disciplines:
- mechanistic-interpretability — reverse-engineering internal computations to make model behavior auditable. Olah et al.’s 2020 circuits research established the methodology of treating individual neurons and attention heads as analyzable components; the field has since scaled to features in production-scale models.
- Oversight methods — RLHF (Christiano et al. 2017) was the first scalable preference-learning technique. Constitutional AI, debate, and recursive reward modeling extend that thread.
- Evaluations — quantitative benchmarks for dangerous capabilities (cyber, bio, autonomy) and for alignment-relevant behaviors (deception, sandbagging, scheming). Run by orgs like METR, Apollo Research, UK AISI.
- Governance and policy — frameworks like the EU AI Act, the US Executive Order on AI, voluntary commitments by major labs, and the work of national AI Safety Institutes. These shape what experiments are run, what data is shared, and what deployment regimes are permitted.
- Risk analysis and forecasting — work attempting to make tractable the question of how much risk, of what kinds, on what timelines, and to whom. Spans catastrophic-risk modeling, capability forecasts, and societal impact analysis.
What the compendium covers
The compendium covers AI safety as a connected field:
- Technical research — papers and posts on alignment, interpretability, evaluations, agentic safety, scalable oversight, and adjacent areas.
- Governance and policy — primary documents, regulatory frameworks, and analyses of how policy interacts with technical capabilities.
- Field-shaping commentary — agenda-setting essays, debate-defining posts, and analyses that shape what counts as a problem.
- Risk and capability assessments — frameworks for forecasting, catastrophic-risk analysis, and societal impact.
Out of scope:
- Pure capabilities work without a safety lens.
- News commentary without primary sources.
- Speculation without empirical or formal grounding.
The boundary is fuzzy and judgment calls are made source-by-source during weekly review. Contradictions between sources are surfaced rather than resolved — see the per-claim citation rule in CLAUDE.md.
Where to start reading
- For a concrete failure mode: deceptive-alignment.
- For methodology debates: scalable-oversight.
- For mechanistic-level work: mechanistic-interpretability.