Interpretability

Understanding what trained neural networks are doing internally — features, circuits, attribution graphs, learned algorithms — sufficient to predict generalization, audit alignment, and detect deceptive-alignment. Mechanistic interpretability is the most ambitious sub-area; activation-space methods, sparse autoencoders, and attribution graphs are the dominant current methodologies. See interpretability for the umbrella concept and reverse-engineering for the SR2025 research agenda.

This is the pillar that does not give up on verification beyond behavioral evaluation — the load-bearing alternative to “test the outputs and hope for the best” once models become capable of strategic deception.

AI Safety Compendium

Explorer

Interpretability

Interpretability

Pages tagged here

Reverse Engineering (Mechanistic Interpretability)

Interpretability

Mechanistic Interpretability