Interpretability

Understanding what trained neural networks are doing internally — features, circuits, attribution graphs, learned algorithms — sufficient to predict generalization, audit alignment, and detect deceptive-alignment. Mechanistic interpretability is the most ambitious sub-area; activation-space methods, sparse autoencoders, and attribution graphs are the dominant current methodologies. See interpretability for the umbrella concept and reverse-engineering for the SR2025 research agenda.

This is the pillar that does not give up on verification beyond behavioral evaluation — the load-bearing alternative to “test the outputs and hope for the best” once models become capable of strategic deception.

Pages tagged here