Propositional Interpretability in Artificial Intelligence

David J. Chalmers — 2025-01-27 — arXiv

Summary

Proposes propositional interpretability as a framework for mechanistic interpretability, arguing AI systems should be interpreted in terms of propositional attitudes (beliefs, desires, probabilities). Analyzes current interpretability methods (probing, SAEs, chain-of-thought) through this philosophical lens and introduces the challenge of ‘thought logging’.

Source