Other interpretability — SR2025 Agenda Snapshot
One-sentence summary: Interpretability that does not fall well into other categories.
Theory of Change
Explore alternative conceptual frameworks (e.g., agentic, propositional) and physics-inspired methods (e.g., renormalization). Or be “pragmatic”.
Broad Approach
engineering / cognitive
Target Case
mixed
Orthodox Problems Addressed
Superintelligence can fool human supervisors, Goals misgeneralize out of distribution
Key People
Lee Sharkey, Dario Amodei, David Chalmers, Been Kim, Neel Nanda, David D. Baek, Lauren Greenspan, Dmitry Vaintrob, Sam Marks, Jacob Pfau
Funding
Estimated FTEs: 30-60
Critiques
The Misguided Quest for Mechanistic AI Interpretability, Interpretability Will Not Reliably Find Deceptive AI.
See Also
reverse-engineering, Concept based interpretability
Outputs in 2025
19 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: other-interpretability (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Other interpretability) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- reverse-engineering
- activation-engineering
- causal-abstractions
- data-attribution
- extracting-latent-knowledge
- human-inductive-biases
- learning-dynamics-and-developmental-interpretability
- lie-and-deception-detectors
- model-diffing
- monitoring-concepts
- pragmatic-interpretability
- representation-structure-and-geometry
- sparse-coding
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]