Heuristic explanations — SR2025 Agenda Snapshot

One-sentence summary: Formalize mechanistic explanations of neural network behavior, automate the discovery of these “heuristic explanations” and use them to predict when novel input will lead to extreme behavior (i.e. “Low Probability Estimation” and “Mechanistic Anomaly Detection”).

Theory of Change

The current goalpost is methods whose reasoned predictions about properties of a neural network’s outputs distribution (for a given inputs distribution) are certifiably at least as accurate as estimations via sampling. If successful for safety-relevant properties, this should allow for automated alignment methods that are both human-legible and worst-case certified, as well more efficient than sampling-based methods in most cases.

Broad Approach

cognitive / maths/philosophy

Target Case

worst-case

Orthodox Problems Addressed

Goals misgeneralize out of distribution, Superintelligence can hack software supervisors

Key People

Jacob Hilton, Mark Xu, Eric Neyman, Victor Lecomte, George Robinson

Funding

Estimated FTEs: 1-10

Critiques

Matolcsi

See Also

ARC Theory, ELK, mechanistic anomaly detection, Acorn, guaranteed-safe-ai

Outputs in 2025

5 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: heuristic-explanations (these were generated alongside this file from the same export).

Source

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.