Against blanket arguments against interpretability
Dmitry Vaintrob — 2025-01-22 — LessWrong
Summary
Technical defense of interpretability research against three major critiques: cryptographic expressivity concerns, prohibitive computational costs of full interpretation, and insufficient progress timelines. Argues that ‘plausible myths’ exist where interpretability techniques could detect deception through scale separation and modularity properties of learned systems.
Source
- Link: https://lesswrong.com/posts/u3ZysuXEjkyHhefrk/against-blanket-arguments-against-interpretability
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- other-interpretability — White-box safety (i.e. Interpretability)