AI explanations of AIs — SR2025 Agenda Snapshot
One-sentence summary: Make open AI tools to explain AIs, including AI agents. e.g. automatic feature descriptions for neuron activation patterns; an interface for steering these features; a behaviour elicitation agent that “searches” for a specified behaviour in frontier models.
Theory of Change
Use AI to help improve interp and evals. Develop and release open tools to level up the whole field. Get invited to improve lab processes.
Broad Approach
cognitive
Target Case
pessimistic
Orthodox Problems Addressed
Superintelligence can fool human supervisors, Superintelligence can hack software supervisors
Key People
Transluce, Jacob Steinhardt, Neil Chowdhury, Vincent Huang, Sarah Schwettmann, Robert Friel
Funding
Schmidt Sciences, Halcyon Futures, John Schulman, Wojciech Zaremba
Estimated FTEs: 15-30
See Also
White box safety i.e. Interpretability
Outputs in 2025
5 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: ai-explanations-of-ais (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = AI explanations of AIs) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- debate
- llm-introspection-training
- supervising-ais-improving-ais
- weak-to-strong-generalization
- extracting-latent-knowledge
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]