Propositional Interpretability in Artificial Intelligence

David J. Chalmers — 2025-01-27 — arXiv

Summary

Proposes propositional interpretability as a framework for mechanistic interpretability, arguing AI systems should be interpreted in terms of propositional attitudes (beliefs, desires, probabilities). Analyzes current interpretability methods (probing, SAEs, chain-of-thought) through this philosophical lens and introduces the challenge of ‘thought logging’.

Source

Link: https://arxiv.org/abs/2501.15740
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- other-interpretability — White-box safety (i.e. Interpretability)

other-interpretability

AI Safety Compendium

Explorer

Propositional Interpretability in Artificial Intelligence

Propositional Interpretability in Artificial Intelligence

Summary

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Propositional Interpretability in Artificial Intelligence

Propositional Interpretability in Artificial Intelligence

Summary

Source

Related Pages

Graph View

Graph view

Table of Contents