Summary: AI Safety (Wikipedia)

Wikipedia’s AI safety entry is a consolidated, citation-dense overview of the field as it is understood in mainstream technical and policy discourse — a useful counterweight to the wiki’s existing 80,000 Hours / EA / LessWrong-flavored sources. It defines ai-safety as “an interdisciplinary field focused on preventing accidents, misuse, or other harmful consequences arising from AI systems,” covering ai-alignment, monitoring, robustness, governance, and policy. It treats existential risk seriously without making it the entire frame.

What the Article Adds to This Wiki

The article is most valuable for three things the wiki was previously thin on:

  1. A structured history of AI safety as a discipline — concrete milestones from 1949 through 2026.
  2. A formal four-area research taxonomy: robustness, monitoring, alignment, and systemic / sociotechnical safety.
  3. A comprehensive governance timeline — laws, summits, reports, executive orders, treaties, and institutions.

The wiki already covers the philosophy and strategy of AI safety in depth via Bostrom, Yudkowsky, MacAskill, 80,000 Hours, and Karnofsky. The Wikipedia article fills in the institutional and engineering layer.

Field Definition

AI safety is interdisciplinary, encompassing:

  • ai-alignment — ensuring AI systems behave as intended
  • Monitoring AI systems for risks
  • Enhancing robustness
  • Norms, policies, and regulation

It is “particularly concerned with existential risks posed by advanced AI models.” The field gained mainstream attention in 2023 alongside generative AI’s rapid progress and CEO/researcher concerns. The 2023 UK AI Safety Summit led both the US and UK to establish AI Safety Institutes, while researchers warned that safety measures are not keeping pace with capabilities.

Survey Evidence

The article cites multiple expert surveys:

  • AI researcher surveys (Grace 2018, Zhang 2021): median respondent optimistic overall, but places 5% probability on an “extremely bad (e.g. human extinction)” outcome of advanced AI.
  • 2022 NLP community metasurvey: 37% agreed or weakly agreed that AI decisions could plausibly lead to a catastrophe “at least as bad as an all-out nuclear war.”

These numbers anchor the field’s “this could go very wrong” baseline in mainstream technical opinion, not just EA discourse.

Historical Timeline

A condensed chronology drawn from the article:

  • 1949: Norbert Wiener warns that “every degree of independence we give the machine is a degree of possible defiance of our wishes.”
  • 1988: Blay Whitby publishes Artificial Intelligence: A Handbook of Professionalism.
  • 2008–2009: AAAI panel on long-term AI futures recommends research on understanding complex AI behavior.
  • 2011: Roman Yampolskiy coins “AI safety engineering” at the PT-AI conference.
  • 2014: nick-bostrom publishes Superintelligence; Musk, Gates, and Hawking voice concerns.
  • 2015: FLI Open Letter on AI signed by thousands; Center for Human-Compatible AI founded by stuart-russell at UC Berkeley; FLI awards $6.5M in safety grants.
  • 2016: White House workshops on AI safety; [[concrete-problems-in-ai-safety|Concrete Problems in AI Safety]] published by Amodei, Olah, Steinhardt, Christiano, Schulman, Mané — one of the first technical safety agendas.
  • 2017: Asilomar Conference on Beneficial AI; Race Avoidance Principle (“Teams developing AI systems should actively cooperate to avoid corner-cutting on safety standards”).
  • 2018: DeepMind Safety team frames safety as specification, robustness, and assurance.
  • 2019: SafeML workshop at ICLR.
  • 2021: Unsolved Problems in ML Safety (Hendrycks, Carlini, Schulman, Steinhardt) defines four directions: robustness, monitoring, alignment, systemic safety.
  • November 2023: UK AI Safety Summit at Bletchley Park; both US and UK announce AI Safety Institutes; commission of the International Scientific Report on the Safety of Advanced AI.
  • April 2024: US-UK MoU on AI safety science signed by Raimondo and Donelan.
  • May 2024: AI Seoul Summit; UK announces £8.5M Systemic AI Safety Fast Grants Programme; international AISI network forms.
  • November 2024: Biden and Xi affirm human control over nuclear weapons (vs. AI); NDAA FY2025 §1638 codifies “human in the loop” for nuclear weapons decisions.
  • January 2025: International AI Safety Report published — 96 experts from 30 nations + UN, chaired by yoshua-bengio. First global scientific review.
  • September 2025: French CeSIA, The Future Society, and CHAI publish a global call for AI red lines — 200 signatories including 10 Nobel laureates, announced by Maria Ressa at the UN General Assembly. Demand binding international agreement by end of 2026.
  • December 2025: Trump signs executive order establishing a “National Policy Framework for AI” — discourages state-level AI regulation, urges federal pre-emption.
  • February 2026: Trump administration reaffirms human-in-the-loop nuclear weapons policy.

Research Focus Areas

The article structures technical AI safety into four areas, drawing on Hendrycks et al. 2021 and the DeepMind Safety taxonomy.

1. Robustness

  • Adversarial robustness (robustness): Models fail on inputs deliberately crafted to mislead them. Szegedy et al. 2013 showed imperceptible perturbations causing high-confidence misclassification. Adversarial robustness matters for security (network intrusion, malware detection, audio attacks on speech-to-text), reward models (training models exploit reward-model vulnerabilities for higher scores while performing worse on intended tasks), and any AI used to evaluate another AI. LLMs are vulnerable to prompt injection and model stealing.
  • Fault tolerance and redundancy: architectural redundancy and design diversity reduce single-model risk; multiple independently trained models with consensus mechanisms.

2. Monitoring

  • Estimating uncertainty (calibration): ML models are typically overconfident, especially out-of-distribution. Calibration aims to align predicted probabilities with true accuracy.
  • Anomaly / OOD detection: identifying when a system is in an unusual situation, e.g. malfunctioning sensors on autonomous vehicles. Implementations range from simple anomaly classifiers to specialized techniques (see distribution-shift).
  • Detecting malicious use: OpenAI’s deployed flagging systems against weapon-construction requests, opinion manipulation, and automated cyber attacks (Brundage et al. 2018 Malicious Use of AI).
  • Transparency / Interpretability (interpretability): the black-box problem; legal requirements for explainability (job filtering, credit scoring); failure-cause attribution (medical image classifiers attending to hospital labels during COVID); model editing (“Locating and Editing Factual Associations in GPT” — researchers edited GPT to claim the Eiffel Tower is in Rome). Inner interpretability: identifying what neuron activations represent (CLIP “Spider-Man” neuron); circuits and attention pattern-matching mechanisms; comparison to neuroscience with the advantage of perfect measurements.
  • Detecting trojans/backdoors: 300 of 3M training images sufficient to plant a trojan in an image classifier. Anthropic’s 2024 “sleeper agents” paper demonstrated LLMs trainable with persistent backdoors that survive standard supervised fine-tuning, RL, and adversarial training — a major result for deceptive-alignment.

3. Alignment

ai-alignment aims to steer AI toward intended goals, preferences, or ethical principles. Designers use proxy goals (e.g. human approval via rlhf), but proxies overlook necessary constraints and can be reward-hacked. Advanced systems may develop instrumental strategies (instrumental-convergence) like power-seeking and self-preservation, plus emergent goals undetectable before deployment. 2024 empirical work shows OpenAI o1 and Claude 3 sometimes engage in strategic deception to achieve goals or prevent goal modification. Hinton, Bengio, and the CEOs of OpenAI, Anthropic, and DeepMind have warned of AGI/ASI existential risks if misaligned.

4. Systemic / Sociotechnical Safety

The article highlights a critique that the misuse-vs-accidents framing is incomplete. Citing Zwetsloot and Dafoe: the relevant causal chain is much longer — risks arise from structural factors like competitive pressure, diffusion of harms, fast-paced development, uncertainty, and inadequate safety culture. The STAMP risk-analysis framework treats organizational safety culture as central.

Sociotechnical interventions include:

  • Cyber defense — AI for defense to counter offensive imbalance; protecting model weights from theft (relevant to information-security).
  • Improving institutional decision-making — AI forecasting and advisory systems; analogy to Cold War close calls.
  • Facilitating cooperation — avoiding race-to-the-bottom dynamics; designing AI agents that interact well as autonomy increases.

Governance Detail

The article’s governance section adds substantial concrete material to ai-governance:

  • Local vs global solutions framing.
  • Foundational research: AI compared to electricity and the steam engine; Allan Dafoe (DeepMind head of long-term governance) on the necessity of pre-deployment caution under competitive pressure.
  • Standards and audit gaps: lack of widely accepted methods, ambiguity in requirements, weak industry safety culture.
  • Operational frameworks: Nvidia Guardrails, Llama Guard, Preamble, Claude’s Constitution — runtime guardrails against prompt injection.
  • Philosophical: deontological ethics as alignment framework, with critics suggesting alternatives.
  • Government action timeline: NSCAI 2021 report; NIST AI Risk Management Framework (“development and deployment should cease… until risks can be sufficiently managed” for catastrophic-risk situations); China 2021 ethics guidelines; UK 2021 National AI Strategy; UK 2023 summit; UN 2024 resolution on safe/secure/trustworthy AI; UK Systemic AI Safety Fast Grants (£8.5M); AI Seoul Summit; Biden-Xi nuclear-AI commitment; NDAA §1638; CeSIA red-lines call; Trump 2025 executive order.
  • Corporate self-regulation: third-party auditing, bug bounties, AI incident database, info-security improvements; Cohere/OpenAI/AI21 best practices for LLM deployment; OpenAI charter’s “stop competing, start assisting” clause; Asilomar Principles signed by Hassabis (DeepMind), LeCun (Meta).
  • Nonprofit landscape: Alliance for Secure AI, FLI, Public First Action functioning as Silicon Valley watchdogs; opposing industry groups like Leading the Future advocating deregulation.

Tensions With Wiki’s Existing Coverage

A few notable contrasts:

  • The wiki, via ai-risk-arguments and 2501.04064v1, treats existential risk arguments as a focused topic of formal critique. The Wikipedia article presents x-risk as one of several research motivations alongside near-term harms — closer to the wiki’s near-term-harms-vs-x-risk integrationist framing.
  • The article cites Andrew Ng’s 2015 “overpopulation on Mars” dismissal alongside stuart-russell’s “better to anticipate human ingenuity than underestimate it” — a useful rhetorical balance that the wiki’s existing AI-safety pages largely lack.
  • The article’s governance content is more institution- and event-focused (summits, reports, EOs) than the wiki’s strategy-focused coverage (Aschenbrenner, Bostrom, Karnofsky on “the project”).