Causal Abstractions — SR2025 Agenda Snapshot
One-sentence summary: Verify that a neural network implements a specific high-level causal model (like a logical algorithm) by finding a mapping between high-level variables and low-level neural representations.
Theory of Change
By establishing a causal mapping between a black-box neural network and a human-interpretable algorithm, we can check whether the model is using safe reasoning processes and predict its behavior on unseen inputs, rather than relying on behavioural testing alone.
Broad Approach
cognitivist science
Target Case
worst-case
Orthodox Problems Addressed
Goals misgeneralize out of distribution
Key People
Atticus Geiger, Christopher Potts, Thomas Icard, Theodora-Mara Pîslar, Sara Magliacane, Jiuding Sun, Jing Huang
Funding
Various academic groups, Google DeepMind, Goodfire
Estimated FTEs: 10-30
Critiques
The Misguided Quest for Mechanistic AI Interpretability, Interpretability Will Not Reliably Find Deceptive AI
See Also
Concept based interpretability, reverse-engineering
Outputs in 2025
3 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: causal-abstractions (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Causal Abstractions) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- reverse-engineering
- activation-engineering
- data-attribution
- extracting-latent-knowledge
- human-inductive-biases
- learning-dynamics-and-developmental-interpretability
- lie-and-deception-detectors
- model-diffing
- monitoring-concepts
- other-interpretability
- pragmatic-interpretability
- representation-structure-and-geometry
- sparse-coding
- natural-abstractions
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]