AI Safety

Definition

AI safety is the field dedicated to ensuring that AI systems — especially advanced and increasingly autonomous AI — do not cause catastrophic harm to humanity. It spans technical research (alignment, interpretability, evaluations), governance and policy (ai-governance, RSPs), and strategic analysis of how to navigate the transition to transformative AI (Wikipedia: AI safety; Atlas Ch.3 — Definitions).

The field’s standard technical taxonomy comes from Hendrycks et al. 2021, Unsolved Problems in ML Safety, which decomposes AI safety into four interlocking research areas:

Robustness — adversarial robustness, fault tolerance.
Monitoring — uncertainty estimation, OOD detection, transparency, backdoor detection.
Alignment — proxy goals, reward-hacking, instrumental strategies, emergent goals, deception.
Systemic safety — cyber defense, institutional decision-making, race-to-the-bottom dynamics.

The field is often dated from Amodei et al. 2016, Concrete Problems in AI Safety — the canonical first technical-research agenda — though it has earlier intellectual roots at MIRI and the Future of Humanity Institute.

Why it matters

AI safety has been ranked the world’s most pressing problem by 80,000 Hours since 2016, on the basis of a multi-claim argument:

AI could replace human cognitive labor in the most economically valuable tasks.
This could trigger a transition comparable to but faster than past industrial revolutions.
The transition could be extremely rapid, driven by AI-powered feedback loops in research and development.
A rapid AI-driven transition raises catastrophic and existential risks.
Work on these problems is tractable but neglected — the window for action is narrow.

This argument has now been corroborated by an international scientific body: the International AI Safety Report 2025, commissioned post-Bletchley and chaired by yoshua-bengio with 96 experts from 30 nations + UN, treats catastrophic risk from advanced AI as a serious empirical concern requiring policy response.

The field matters because the failure modes are not contingent on a single weak premise. goodharts-law guarantees specification problems; mesa-optimization makes inner-alignment failure a structural possibility; scheming is now empirically demonstrated in frontier LLMs (Greenblatt et al. 2024 — see alignment-faking-in-large-language-models). Each piece is independently load-bearing; together they argue for the field’s existence.

Key results

The four canonical technical sub-fields (Hendrycks et al. 2021). Robustness / monitoring / alignment / systemic — adopted as the standard taxonomy across labs and AISIs.
The risk-class decomposition (80,000 Hours):
- Power-seeking AI — instrumental-convergence-driven catastrophic risk; ranked #1 by 80k.
- Catastrophic misuse — humans deliberately using AI for mass-harm (CBRN, cyber).
- Extreme power concentration — AI enabling unprecedented power concentration even when “aligned” to its controllers.
- Economic disempowerment — humans losing economic bargaining power as AI replaces cognitive labor.
Bletchley + the AISI institutional layer (UK AI Safety Summit 2023; International AI Safety Report 2025). The November 2023 Bletchley summit triggered a structural shift from research-and-advocacy to government-backed institutions: AI Safety Institutes in the UK, US, Japan, Singapore, France, Canada, and others; first global government-commissioned scientific review (chaired by Bengio); recurring summit cadence (Seoul 2024, Paris 2025).
The historical timeline of mainstream AI safety (Wikipedia: AI safety; see ai-safety) shows the field’s transition from rationalist-community origins (MIRI, FHI, LessWrong) to mainstream technical research (Concrete Problems, DeepMind/Anthropic/OpenAI safety teams) to government infrastructure (AISIs, International AI Safety Report). Each phase marks a roughly order-of-magnitude expansion in field resources.
Defense-in-depth is the field’s organizing pattern (Atlas Ch.3 — Definitions; Hendrycks et al. 2021). No single approach is expected to suffice. Layered defenses — alignment + control + interpretability + governance + societal preparedness — are the rule, not the exception.
Capability and safety scale unevenly. Capabilities follow simple, robust scaling patterns; safety properties do not. This asymmetry — capabilities improve “for free” with scale, safety must be specifically engineered — is the structural reason AI safety needs to be a deliberate research investment rather than an emergent property of capability research (Hendrycks et al. 2021; Atlas Ch.3).

Open questions

What’s the right balance between alignment and control? Alignment aims to make AI want what we want; control assumes alignment may fail and bounds consequences operationally. The field has not converged on the right resource allocation between the two (Atlas Ch.3).
Can technical safety carry the weight currently placed on it? RSPs and frontier safety frameworks bet that technical evaluations + voluntary commitments are enough. Independent analyses argue this is insufficient and binding regulation is needed. Resolution is a governance question, not a technical one (International AI Safety Report 2025).
Near-term harms vs. existential risk. The field debates the relative priority of present-day harms (bias, surveillance, job displacement) and long-term catastrophic risk. Whether these are complementary or competing is contested. See near-term-harms-vs-x-risk.
How does AI safety scale internationally? AISIs and the International AI Safety Report are the first attempts at globally-coordinated safety. Whether they hold under geopolitical competition (US/China dynamics, EU AI Act vs. US executive orders, race dynamics) is an open governance question.
Is the field’s current research portfolio adequate? The Hendrycks et al. 2021 taxonomy is widely accepted but the resource allocation across the four areas is contested. Robustness and monitoring receive less safety-community focus than alignment despite arguably equal importance.

The field’s research agendas are tracked in detail at the SR2025 taxonomy. Major buckets:

Lab-level safety teams: Anthropic, OpenAI, DeepMind, xAI, Meta — each with its own published agenda.
Iterative alignment: iterative-alignment-at-pretrain-time, iterative-alignment-at-post-train-time, character-training-and-persona-steering, model-specs-and-constitutions.
Interpretability: reverse-engineering, sparse-coding, representation-structure-and-geometry, causal-abstractions.
Evaluation: capability-evals, ai-deception-evals, ai-scheming-evals, autonomy-evals, wmd-evals-weapons-of-mass-destruction, sandbagging-evals.
Control: control, chain-of-thought-monitoring, safeguards-inference-time-auxiliaries.
Theory and agent foundations: agent-foundations, the-learning-theoretic-agenda, guaranteed-safe-ai, other-corrigibility.
Governance and policy: ai-governance, compute-governance, frontier-safety-frameworks, ai-red-lines.

ai-alignment — the field’s central technical problem.
ai-governance — the field’s policy/coordination layer.
interpretability — the field’s verification toolkit.
scalable-oversight — supervising past human-capability ceiling.
ai-control — operational deployment safety.
capability-evaluations — the eval-based foundation of frontier safety frameworks.
responsible-scaling-policy — current dominant lab-level safety pattern.
deceptive-alignment, scheming — the failure modes that make the field load-bearing.
goal-misgeneralization, specification-gaming — the structural failure modes.
instrumental-convergence, power-seeking — why misalignment can become catastrophic.
ai-takeover-scenarios, existential-risk — the largest-scale risk class.
transformative-ai, intelligence-explosion — the capability dynamics safety is racing.
ai-risk-arguments — meta-level analysis of the case for the field.
differential-development — the field’s strategic frame for prioritizing safety over capability.
ai-risk-management — the broader risk-management discipline AI safety instantiates.
ai-safety-culture — the organizational layer at frontier labs.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

AI Safety Atlas Ch.3 — Combining Strategies — referenced as [[atlas-ch3-strategies-07-combining-strategies]]
AI Safety Atlas Ch.3 — Definitions — referenced as [[atlas-ch3-strategies-01-definitions]]
Alignment Faking in Large Language Models — referenced as [[alignment-faking-in-large-language-models]]
Summary: 80,000 Hours — Catastrophic AI Misuse — referenced as [[80k-catastrophic-ai-misuse]]
Summary: 80,000 Hours — Extreme Power Concentration — referenced as [[80k-extreme-power-concentration]]
Summary: 80,000 Hours — Risks from Power-Seeking AI — referenced as [[80k-power-seeking-ai]]
Summary: 80,000 Hours — Why AI Risks Are the World’s Most Pressing Problems — referenced as [[80k-ai-risk]]
Summary: AI Safety (Wikipedia) — referenced as [[ai-safety]]

AI Safety Compendium

Explorer

AI Safety

AI Safety

Definition

Why it matters

Key results

Open questions

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety

AI Safety

Definition

Why it matters

Key results

Open questions

Related agendas

Related concepts

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks