Alignment to Whom

The “alignment to whom” question is the structural decomposition of AI alignment by principal-agent configuration: are we aligning one AI to one human, multiple AIs to one human, one AI to many humans, or many AIs to many humans? The AI Safety Atlas (Ch.3 long-term-questions appendix) treats these as four distinct alignment problems with different failure modes.

This is complementary to [[coherent-extrapolated-volition|“alignment to what”]]: alignment to whom determines the target principal; alignment to what determines the value content.

The Four Configurations

Single-Single Alignment

One AI ↔ one human.

Current approaches focus on intent alignment — interpreting intended meaning rather than literal commands. This is the foundational unsolved problem; everything else builds on it.

Key challenges:

  • Human’s intent isn’t always clear, even to themselves
  • Specification gaps between literal commands and intended outcomes
  • Faithful interpretation of preferences vs. paternalism

Most current AI safety research operates at this level (see ai-alignment).

Single-Multi Alignment

Multiple AIs ↔ one human.

A coordinated AI system serving an individual. Includes:

  • ASIs composed of cooperating smaller intelligences
  • Personal AI assistants with multiple specialized sub-AIs
  • AI agents coordinating to fulfill one principal’s goals

Critical caveat: still requires solving single-single first. Even cooperating AIs need to faithfully serve their human principal — single-single alignment failures multiply.

The Atlas’s structural concern: “Ideally, no individual or small group should control superintelligence.” Single-multi alignment, taken literally, enables unilateral control of advanced AI by individuals — not necessarily a desirable property.

Multi-Single Alignment

One AI ↔ many humans.

A single AI system serving many people. The challenge: aggregating individual preferences creates contradictions (different people want incompatible things).

Promising approach: align AI to higher-level institutional principles and values, mirroring democratic institutions operating through transparency and accountability rather than direct preference optimization.

This is the configuration of:

  • Public-facing AI assistants (ChatGPT serving many users)
  • Government-deployed AI in regulatory or service roles
  • AI in democratic institutions

Key open problem: how to handle conflicts. Voting? Deliberation? Market mechanisms? Each implies a different alignment architecture.

Multi-Multi Alignment

Multiple AIs ↔ multiple humans.

The most general configuration — and the realistic future. Multiple AI systems interacting with multiple humans, with potential conflicts of interest at every level.

The misalignment-vs-misuse distinction blurs here:

  • Misalignment risk — AI gaining power over humans
  • Misuse risk — humans gaining power over others via AI
  • Multi-multi — these can be the same dynamic, just framed differently

Success requires system design preventing problematic power concentrations while enabling beneficial cooperation across the entire human-AI network.

The Independence-Failure Insight

The Atlas’s deepest point: perfect individual alignment cannot guarantee safe collective behavior.

“Different principals may have conflicting interests, or systems might fail coordinating despite aligned goals. Perfect driver-law alignment doesn’t prevent traffic jams or accidents. Multi-agent system failures represent distinct failure modes beyond individual agent problems.”

Each configuration introduces failure modes the previous one didn’t have:

  • Single-single — alignment with the principal
  • Single-multi — single-single failures + sub-AI coordination failures
  • Multi-single — preference aggregation, conflicting principals
  • Multi-multi — all of the above + emergent multi-agent dynamics + power concentration risks

Why This Matters Strategically

The four configurations require different research agendas:

The Atlas’s framing aligns with the SR2025 review’s organization: SR2025 includes a whole “Multi-agent first” section recognizing that single-single alignment alone is insufficient.

Connection to Wiki

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.