AI Safety Atlas Ch.3 — Definitions

Source: Definitions

“Before we can build solutions, we must agree on the problem.” This subchapter disambiguates four overlapping but distinct concepts: AI safety, AI alignment, AI ethics, AI control. The wiki’s existing pages all benefit from these clean definitions.

AI Safety (Definition 3.1)

“Ensuring that AI systems do not inadvertently or deliberately cause harm or danger to humans or the environment, through research that identifies causes of unintended AI behavior and develops tools for safe and reliable operation.”

The broadest category. Encompasses robustness, monitoring, capability control, and more. While ai-alignment addresses goals/intentions, safety encompasses a wider set of concerns. See ai-safety.

AI Alignment (Definition 3.2)

“The problem of building machines that faithfully try to do what we want them to do (or what we ought to want them to do).” — Christiano, 2024

A subset of safety, focused on objective/value matching. The Atlas highlights an important reframe: systems could theoretically be aligned but unsafe (pursuing wrong goals competently) or safe but unaligned (constrained despite misaligned objectives).

Key open questions:

  • Broader vs. narrower definitions — does alignment include creating beneficial outcomes, or only intent-matching?
  • “Aligned to whom?” — operators, designers, specific groups, all humanity, ethical principles, hypothetical informed preferences? See alignment-to-whom.
  • Applying “trying” / “intent” to AI — non-trivial; optimization objectives don’t directly translate to human-like intentions.

See ai-alignment.

AI Ethics (Definition 3.3)

“The study and application of moral principles to AI development and deployment, addressing questions of fairness, transparency, accountability, privacy, autonomy, and other human values that AI systems should respect or promote.” — Huang et al., 2023

Complements technical safety with normative guidance. While alignment ensures systems pursue intended objectives, ethics determines which objectives are worth pursuing. Includes digital rights, potential AI rights, fairness, bias mitigation.

This page-level definition is new to the wiki — connects to the near-term-harms-vs-x-risk tension where AI ethics and AI safety often debate priorities.

AI Control (Definition 3.4)

“The technical and procedural measures designed to prevent AI systems from causing unacceptable outcomes, even if these systems actively attempt to subvert safety measures. Control focuses on maintaining human oversight regardless of whether the AI’s objectives align with human intentions.” — Greenblatt et al., 2024

A distinct strategy from alignment. Alignment prevents preference divergence by design; control creates security layers that function even when alignment fails. Methods: monitoring actions, restricting capabilities, human auditing, termination mechanisms.

The Atlas’s framing — “complementary approaches” — substantially clarifies the wiki’s existing ai-control page, which centers on buck-shlegeris’s framing.

The Four-Way Decomposition

AspectAI SafetyAI AlignmentAI EthicsAI Control
ScopeBroadestGoal-matchingNormativeOperational containment
Key questionDoes it harm?Does it pursue our goals?Should it pursue this?Can we stop it if it doesn’t?
Failure modeUnsafe deploymentMisaligned objectivesWrong objectivesLoss of oversight

Connection to Wiki

This subchapter provides clean definitional anchors for the wiki’s existing pages:

  • ai-safety — Definition 3.1 is now the canonical definition
  • ai-alignment — Definition 3.2 (Christiano 2024); the broader/narrower distinction
  • ai-control — Definition 3.4 (Greenblatt et al. 2024) — adds rigor to the existing page
  • New AI Ethics treatment fills a gap — referenced loosely in near-term-harms-vs-x-risk but not previously a defined concept