Training Language Models to Explain Their Own Computations
Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, Jacob Andreas — 2025-11-11 — UC Berkeley, MIT — arXiv
Summary
Fine-tunes language models to generate natural language descriptions of their internal computations (encoded information, causal structure, token influence) using existing interpretability techniques as ground truth, demonstrating that models can learn to explain themselves better than other models can explain them.
Key Result
Explainer models trained on tens of thousands of examples exhibit non-trivial generalization to new queries, with self-explanation outperforming cross-model explanation even when the other model is more capable.
Source
- Link: https://arxiv.org/abs/2511.08579
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- llm-introspection-training — Make AI solve it