Principles for Picking Practical Interpretability Projects

Sam Marks — 2025-07-15 — Anthropic — LessWrong

Summary

Proposes a framework for selecting interpretability research directions, arguing interpretability is most valuable when there are structural obstructions to behavioral oversight, and lists concrete problem areas including deception detection, spurious cue removal, and sandbagging detection.

Source