Guaranteed-Safe AI — SR2025 Agenda Snapshot
One-sentence summary: Have an AI system generate outputs (e.g. code, control systems, or RL policies) which it can quantitatively guarantee comply with a formal safety specification and world model.
Theory of Change
Various, including:
i) safe deployment: create a scalable process to get not-fully-trusted AIs to produce highly trusted outputs;
ii) secure containers: create a ‘gatekeeper’ system that can act as an intermediary between human users and a potentially dangerous system, only letting provably safe actions through.
(Notable for not requiring that we solve ELK; does require that we solve ontology though)
Broad Approach
cognitive / engineering
Target Case
worst-case
Orthodox Problems Addressed
Value is fragile and hard to specify, Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Humans cannot be first-class parties to a superintelligent value handshake, A boxed AGI might exfiltrate itself by steganography, spearphishing
Key People
ARIA, Lawzero, Atlas Computing, FLF, Max Tegmark, Beneficial AI Foundation, Steve Omohundro, David “davidad” Dalrymple, Joar Skalse, Stuart Russell, Alessandro Abate
Funding
Manifund, ARIA, Coefficient Giving, Survival and Flourishing Fund, Mila / CIFAR
Estimated FTEs: 10-100
Critiques
Zvi, Gleave, Dickson, Greenblatt
See Also
Towards Guaranteed Safe AI, Standalone World-Models, scientist-ai, Safeguarded AI, asymptotic-guarantees, Open Agency Architecture, SLES, program synthesis, Scalable formal oversight
Outputs in 2025
5 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: guaranteed-safe-ai (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Guaranteed-Safe AI) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- asymptotic-guarantees
- scientist-ai
- brainlike-agi-safety
- assistance-games-assistive-agents
- heuristic-explanations
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]