Causal Abstractions — SR2025 Agenda Snapshot

One-sentence summary: Verify that a neural network implements a specific high-level causal model (like a logical algorithm) by finding a mapping between high-level variables and low-level neural representations.

Theory of Change

By establishing a causal mapping between a black-box neural network and a human-interpretable algorithm, we can check whether the model is using safe reasoning processes and predict its behavior on unseen inputs, rather than relying on behavioural testing alone.

Broad Approach

cognitivist science

Target Case

worst-case

Orthodox Problems Addressed

Goals misgeneralize out of distribution

Key People

Atticus Geiger, Christopher Potts, Thomas Icard, Theodora-Mara Pîslar, Sara Magliacane, Jiuding Sun, Jing Huang

Funding

Various academic groups, Google DeepMind, Goodfire

Estimated FTEs: 10-30

Critiques

The Misguided Quest for Mechanistic AI Interpretability, Interpretability Will Not Reliably Find Deceptive AI

Outputs in 2025

3 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: causal-abstractions (these were generated alongside this file from the same export).

Source

Row in shallow-review-2025/agendas.csv (name = Causal Abstractions) — Shallow Review of Technical AI Safety 2025.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

Summary: AI Safety (Wikipedia) — referenced as [[ai-safety]]

AI Safety Compendium

Explorer

Causal Abstractions

Causal Abstractions — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Orthodox Problems Addressed

Key People

Funding

Critiques

See Also

Outputs in 2025

Source

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Causal Abstractions

Causal Abstractions — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Orthodox Problems Addressed

Key People

Funding

Critiques

See Also

Outputs in 2025

Source

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks