Heuristic explanations — SR2025 Agenda Snapshot

One-sentence summary: Formalize mechanistic explanations of neural network behavior, automate the discovery of these “heuristic explanations” and use them to predict when novel input will lead to extreme behavior (i.e. “Low Probability Estimation” and “Mechanistic Anomaly Detection”).

Theory of Change

The current goalpost is methods whose reasoned predictions about properties of a neural network’s outputs distribution (for a given inputs distribution) are certifiably at least as accurate as estimations via sampling. If successful for safety-relevant properties, this should allow for automated alignment methods that are both human-legible and worst-case certified, as well more efficient than sampling-based methods in most cases.

Broad Approach

cognitive / maths/philosophy

Target Case

worst-case

Orthodox Problems Addressed

Goals misgeneralize out of distribution, Superintelligence can hack software supervisors

Key People

Jacob Hilton, Mark Xu, Eric Neyman, Victor Lecomte, George Robinson

Funding

Estimated FTEs: 1-10

Critiques

Matolcsi

Outputs in 2025

5 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: heuristic-explanations (these were generated alongside this file from the same export).

Source

Row in shallow-review-2025/agendas.csv (name = Heuristic explanations) — Shallow Review of Technical AI Safety 2025.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

Summary: AI Safety (Wikipedia) — referenced as [[ai-safety]]

AI Safety Compendium

Explorer

Heuristic explanations

Heuristic explanations — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Orthodox Problems Addressed

Key People

Funding

Critiques

See Also

Outputs in 2025

Source

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Heuristic explanations

Heuristic explanations — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Orthodox Problems Addressed

Key People

Funding

Critiques

See Also

Outputs in 2025

Source

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks