Propositional Interpretability in Artificial Intelligence
David J. Chalmers — 2025-01-27 — arXiv
Summary
Proposes propositional interpretability as a framework for mechanistic interpretability, arguing AI systems should be interpreted in terms of propositional attitudes (beliefs, desires, probabilities). Analyzes current interpretability methods (probing, SAEs, chain-of-thought) through this philosophical lens and introduces the challenge of ‘thought logging’.
Source
- Link: https://arxiv.org/abs/2501.15740
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- other-interpretability — White-box safety (i.e. Interpretability)