Guaranteed-Safe AI — SR2025 Agenda Snapshot

One-sentence summary: Have an AI system generate outputs (e.g. code, control systems, or RL policies) which it can quantitatively guarantee comply with a formal safety specification and world model.

Theory of Change

Various, including:

i) safe deployment: create a scalable process to get not-fully-trusted AIs to produce highly trusted outputs;

ii) secure containers: create a ‘gatekeeper’ system that can act as an intermediary between human users and a potentially dangerous system, only letting provably safe actions through.

(Notable for not requiring that we solve ELK; does require that we solve ontology though)

Broad Approach

cognitive / engineering

Target Case

worst-case

Orthodox Problems Addressed

Value is fragile and hard to specify, Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Humans cannot be first-class parties to a superintelligent value handshake, A boxed AGI might exfiltrate itself by steganography, spearphishing

Key People

ARIA, Lawzero, Atlas Computing, FLF, Max Tegmark, Beneficial AI Foundation, Steve Omohundro, David “davidad” Dalrymple, Joar Skalse, Stuart Russell, Alessandro Abate

Funding

Manifund, ARIA, Coefficient Giving, Survival and Flourishing Fund, Mila / CIFAR

Estimated FTEs: 10-100

Critiques

Zvi, Gleave, Dickson, Greenblatt

See Also

Towards Guaranteed Safe AI, Standalone World-Models, scientist-ai, Safeguarded AI, asymptotic-guarantees, Open Agency Architecture, SLES, program synthesis, Scalable formal oversight

Outputs in 2025

5 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: guaranteed-safe-ai (these were generated alongside this file from the same export).

Source

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.