Training Language Models to Explain Their Own Computations

Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, Jacob Andreas — 2025-11-11 — UC Berkeley, MIT — arXiv

Summary

Fine-tunes language models to generate natural language descriptions of their internal computations (encoded information, causal structure, token influence) using existing interpretability techniques as ground truth, demonstrating that models can learn to explain themselves better than other models can explain them.

Key Result

Explainer models trained on tens of thousands of examples exhibit non-trivial generalization to new queries, with self-explanation outperforming cross-model explanation even when the other model is more capable.

Source