Against blanket arguments against interpretability

Dmitry Vaintrob — 2025-01-22 — LessWrong

Summary

Technical defense of interpretability research against three major critiques: cryptographic expressivity concerns, prohibitive computational costs of full interpretation, and insufficient progress timelines. Argues that ‘plausible myths’ exist where interpretability techniques could detect deception through scale separation and modularity properties of learned systems.

Source