Control
What the agenda is
The AI control agenda asks: how do we safely deploy AI systems even if they are actively misaligned and trying to subvert our safety measures? Rather than betting on aligning the model’s goals, control assumes the worst case (the model is scheming) and develops deployment-time protocols that bound the probability of catastrophic outcomes regardless. The agenda was formalized by Greenblatt et al. 2023, AI Control: Improving Safety Despite Intentional Subversion.
Control is a complement to alignment research, not a replacement: alignment aims to make the model want what we want; control aims to bound consequences if it doesn’t (80,000 Hours: Buck Shlegeris on AI control + scheming; see also the concept page ai-control).
Lead orgs & people
- redwood-research — Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan, Alex Mallen, Fabien Roger.
- UK AISI — control evaluations of frontier models.
- deepmind — Mary Phuong, Sébastien Bubeck (control-relevant capability evals).
- openai — Bowen Baker, Joost Huizinga (CoT-monitoring + obfuscation work).
- anthropic — Ethan Perez (Control team), Fabien Roger.
Current state (2026)
-
The methodology is established and being deployed. Greenblatt et al.’s red-team / blue-team game framework is the standard control-evaluation methodology; UK AISI has used it on frontier models, and Anthropic’s Control team applies it internally (Greenblatt et al. 2023; 80,000 Hours: Buck Shlegeris).
-
Resampling (Ctrl-Z) extends control to interactive agents (Bushnaq et al. 2025, Ctrl-Z: Controlling AI Agents via Resampling). When a monitor flags a suspect action, resample from the agent rather than commit. Maintains control guarantees while keeping productivity high.
-
Disjunctive monitoring removes the trustedness-threshold problem (Roger 2026). Ensembling monitors of varying capability and blocking on any positive flag matches the safety of an oracle protocol — without needing to identify the trustedness threshold in advance.
-
CoT monitoring as control-layer trip-wire (Baker et al. 2025, Monitoring Reasoning Models for Misbehavior). Monitor reasoning traces for evidence of attempted sabotage. Direct training pressure on CoT pushes the misbehavior underground; read-only monitoring is the active research frontier.
-
Empirical control-safety cases are now being drafted (Roger et al., Sketch of a control safety case, see takeaways-from-sketching-a-control-safety-case). The methodology is moving from “research idea” to “documentation that frontier-lab safety teams can use as templates.”
-
~22 papers tagged to the control agenda in Shallow Review 2025 — one of the fastest-growing safety research areas. Estimated 5-50 FTEs across the named orgs.
Recent papers
(Chronological, newest last; agenda outputs visible in the wiki via summary pages.)
- Greenblatt et al. 2023 — AI Control: Improving Safety Despite Intentional Subversion (foundational).
- ctrl-z-controlling-ai-agents-via-resampling — Bushnaq et al. 2025.
- control-protocols-don-t-always-need-to-know-which-models-are-scheming — Roger 2026.
- controlarena — control benchmark.
- games-for-ai-control — adversarial-game formulation.
- luthien-s-approach-to-prosaic-ai-control-in-21-points — Luthien framing.
- takeaways-from-sketching-a-control-safety-case — methodology for safety cases.
- understanding-and-controlling-llm-generalization — generalization-based control.
- maintaining-alignment-during-rsi-as-a-feedback-control-probl — control under recursive self-improvement.
Historical foundations
- The threat-model anchor: Hubinger et al. 2019, Risks from Learned Optimization — mesa-optimization makes deceptive-alignment structurally plausible, which is the failure mode control specifically responds to.
- Earlier “boxing” and AI containment proposals (MIRI, FHI, ~2010s) are the philosophical predecessors but were never operationalized into testable protocols. Greenblatt et al. 2023 is the first work to make the agenda empirically tractable.
- The scalable-oversight research program (Christiano, Leike) is methodologically adjacent: both address “how do we work safely with smarter-than-us systems,” but oversight assumes the model is approximately-aligned while control assumes the worst case.
Open problems
-
Capability-gap robustness. The compound-probability and untrusted-monitoring arguments rely on the scheming model being roughly comparable to the trusted-monitor stack. Capability gaps break the assumption (Greenblatt et al. 2023, §7).
-
Coordination via shared training data. The “no coordination” assumption is doing real work. If two untrusted instances share enough pretraining data, they might converge on attack strategies without explicit communication (Roger 2026).
-
The productivity / safety frontier. Disjunctive monitoring maximizes recall but at a precision cost; productivity loss compounds. The Pareto frontier of safety vs. productivity for realistic deployments is largely uncharacterized (Roger 2026).
-
CoT-monitoring under training pressure. Direct optimization against CoT causes obfuscation (Baker et al. 2025). Whether read-only monitoring stays useful as deployment pressure rises is the central open methodological question.
-
Internal critiques. Wentworth argues the “case against AI control research” — control may give false confidence and divert from alignment. Manheim argues current control doesn’t deliver meaningful oversight. Kulveit argues control may increase existential risk by making deployment of misaligned systems easier. These are taken seriously; the agenda’s response is empirical (“test the protocols”) rather than theoretical (“argue them down”).
-
Ethics of deploying scheming systems. Permanently constraining a potentially-conscious system raises real moral questions; Shlegeris frames control as a transitional measure during the alignment-research bootstrap. Boundary conditions are unsettled (80,000 Hours: Buck Shlegeris).
Related Pages
- ai-control
- ai-alignment
- deceptive-alignment
- scheming
- scalable-oversight
- chain-of-thought-monitoring
- capability-evals
- ai-deception-evals
- ai-scheming-evals
- model-organisms-of-misalignment
- safeguards-inference-time-auxiliaries
- redwood-research
- ai-safety-institute
- deepmind
- anthropic
- openai
- buck-shlegeris
- ai-safety
- ai-safety
- assistance-games-assistive-agents
- black-box-make-ai-solve-it
- capability-removal-unlearning
- character-training-and-persona-steering
- data-filtering
- data-poisoning-defense
- data-quality-for-alignment
- emergent-misalignment
- harm-reduction-for-open-weights
- hyperstition-studies
- inference-time-in-context-learning
- inference-time-steering
- inoculation-prompting
- iterative-alignment-at-post-train-time
- iterative-alignment-at-pretrain-time
- mild-optimisation
- model-psychopathology
- model-specs-and-constitutions
- model-values-model-preferences
- rl-safety
- synthetic-data-for-alignment
- the-neglected-approaches-approach
- asymptotic-guarantees
- behavior-alignment-theory
- ctrl-z-controlling-ai-agents-via-resampling
- control-protocols-don-t-always-need-to-know-which-models-are-scheming
- takeaways-from-sketching-a-control-safety-case
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Control protocols don’t always need to know which models are scheming — referenced as
[[control-protocols-don-t-always-need-to-know-which-models-are-scheming]] - ControlArena — referenced as
[[controlarena]] - Ctrl-Z: Controlling AI Agents via Resampling — referenced as
[[ctrl-z-controlling-ai-agents-via-resampling]] - Games for AI Control — referenced as
[[games-for-ai-control]] - Luthien’s Approach to Prosaic AI Control in 21 Points — referenced as
[[luthien-s-approach-to-prosaic-ai-control-in-21-points]] - Maintaining Alignment during RSI as a Feedback Control Problem — referenced as
[[maintaining-alignment-during-rsi-as-a-feedback-control-probl]] - Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]] - Takeaways from sketching a control safety case — referenced as
[[takeaways-from-sketching-a-control-safety-case]] - Understanding and Controlling LLM Generalization — referenced as
[[understanding-and-controlling-llm-generalization]]