Heuristic explanations — SR2025 Agenda Snapshot
One-sentence summary: Formalize mechanistic explanations of neural network behavior, automate the discovery of these “heuristic explanations” and use them to predict when novel input will lead to extreme behavior (i.e. “Low Probability Estimation” and “Mechanistic Anomaly Detection”).
Theory of Change
The current goalpost is methods whose reasoned predictions about properties of a neural network’s outputs distribution (for a given inputs distribution) are certifiably at least as accurate as estimations via sampling. If successful for safety-relevant properties, this should allow for automated alignment methods that are both human-legible and worst-case certified, as well more efficient than sampling-based methods in most cases.
Broad Approach
cognitive / maths/philosophy
Target Case
worst-case
Orthodox Problems Addressed
Goals misgeneralize out of distribution, Superintelligence can hack software supervisors
Key People
Jacob Hilton, Mark Xu, Eric Neyman, Victor Lecomte, George Robinson
Funding
Estimated FTEs: 1-10
Critiques
See Also
ARC Theory, ELK, mechanistic anomaly detection, Acorn, guaranteed-safe-ai
Outputs in 2025
5 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: heuristic-explanations (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Heuristic explanations) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- guaranteed-safe-ai
- agent-foundations
- asymptotic-guarantees
- behavior-alignment-theory
- high-actuation-spaces
- natural-abstractions
- other-corrigibility
- the-learning-theoretic-agenda
- tiling-agents
- extracting-latent-knowledge
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]