Principles for Picking Practical Interpretability Projects
Sam Marks — 2025-07-15 — Anthropic — LessWrong
Summary
Proposes a framework for selecting interpretability research directions, arguing interpretability is most valuable when there are structural obstructions to behavioral oversight, and lists concrete problem areas including deception detection, spurious cue removal, and sandbagging detection.
Source
- Link: https://lesswrong.com/posts/DqaoPNqhQhwBFqWue/principles-for-picking-practical-interpretability-projects
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- other-interpretability — White-box safety (i.e. Interpretability)