Guaranteed-Safe AI — SR2025 Agenda Snapshot

One-sentence summary: Have an AI system generate outputs (e.g. code, control systems, or RL policies) which it can quantitatively guarantee comply with a formal safety specification and world model.

Theory of Change

Various, including:

i) safe deployment: create a scalable process to get not-fully-trusted AIs to produce highly trusted outputs;

ii) secure containers: create a ‘gatekeeper’ system that can act as an intermediary between human users and a potentially dangerous system, only letting provably safe actions through.

(Notable for not requiring that we solve ELK; does require that we solve ontology though)

Broad Approach

cognitive / engineering

Target Case

worst-case

Orthodox Problems Addressed

Value is fragile and hard to specify, Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Humans cannot be first-class parties to a superintelligent value handshake, A boxed AGI might exfiltrate itself by steganography, spearphishing

Key People

ARIA, Lawzero, Atlas Computing, FLF, Max Tegmark, Beneficial AI Foundation, Steve Omohundro, David “davidad” Dalrymple, Joar Skalse, Stuart Russell, Alessandro Abate

Funding

Manifund, ARIA, Coefficient Giving, Survival and Flourishing Fund, Mila / CIFAR

Estimated FTEs: 10-100

Critiques

Zvi, Gleave, Dickson, Greenblatt

Outputs in 2025

5 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: guaranteed-safe-ai (these were generated alongside this file from the same export).

Source

Row in shallow-review-2025/agendas.csv (name = Guaranteed-Safe AI) — Shallow Review of Technical AI Safety 2025.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

Summary: AI Safety (Wikipedia) — referenced as [[ai-safety]]

AI Safety Compendium

Explorer

Guaranteed-Safe AI

Guaranteed-Safe AI — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Orthodox Problems Addressed

Key People

Funding

Critiques

See Also

Outputs in 2025

Source

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Guaranteed-Safe AI

Guaranteed-Safe AI — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Orthodox Problems Addressed

Key People

Funding

Critiques

See Also

Outputs in 2025

Source

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks