Causal Abstractions — SR2025 Agenda Snapshot

One-sentence summary: Verify that a neural network implements a specific high-level causal model (like a logical algorithm) by finding a mapping between high-level variables and low-level neural representations.

Theory of Change

By establishing a causal mapping between a black-box neural network and a human-interpretable algorithm, we can check whether the model is using safe reasoning processes and predict its behavior on unseen inputs, rather than relying on behavioural testing alone.

Broad Approach

cognitivist science

Target Case

worst-case

Orthodox Problems Addressed

Goals misgeneralize out of distribution

Key People

Atticus Geiger, Christopher Potts, Thomas Icard, Theodora-Mara Pîslar, Sara Magliacane, Jiuding Sun, Jing Huang

Funding

Various academic groups, Google DeepMind, Goodfire

Estimated FTEs: 10-30

Critiques

The Misguided Quest for Mechanistic AI Interpretability, Interpretability Will Not Reliably Find Deceptive AI

See Also

Concept based interpretability, reverse-engineering

Outputs in 2025

3 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: causal-abstractions (these were generated alongside this file from the same export).

Source

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.