Training Language Models to Explain Their Own Computations

Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, Jacob Andreas — 2025-11-11 — UC Berkeley, MIT — arXiv

Summary

Fine-tunes language models to generate natural language descriptions of their internal computations (encoded information, causal structure, token influence) using existing interpretability techniques as ground truth, demonstrating that models can learn to explain themselves better than other models can explain them.

Key Result

Explainer models trained on tens of thousands of examples exhibit non-trivial generalization to new queries, with self-explanation outperforming cross-model explanation even when the other model is more capable.

Source

Link: https://arxiv.org/abs/2511.08579
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- llm-introspection-training — Make AI solve it

llm-introspection-training

AI Safety Compendium

Explorer

Training Language Models to Explain Their Own Computations

Training Language Models to Explain Their Own Computations

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Training Language Models to Explain Their Own Computations

Training Language Models to Explain Their Own Computations

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents