Against blanket arguments against interpretability

Dmitry Vaintrob — 2025-01-22 — LessWrong

Summary

Technical defense of interpretability research against three major critiques: cryptographic expressivity concerns, prohibitive computational costs of full interpretation, and insufficient progress timelines. Argues that ‘plausible myths’ exist where interpretability techniques could detect deception through scale separation and modularity properties of learned systems.

Source

Link: https://lesswrong.com/posts/u3ZysuXEjkyHhefrk/against-blanket-arguments-against-interpretability
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- other-interpretability — White-box safety (i.e. Interpretability)

other-interpretability

AI Safety Compendium

Explorer

Against blanket arguments against interpretability

Against blanket arguments against interpretability

Summary

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Against blanket arguments against interpretability

Against blanket arguments against interpretability

Summary

Source

Related Pages

Graph View

Graph view

Table of Contents