Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing
Zhe Li, Wei Zhao, Yige Li, Jun Sun — 2025-09-26 — arXiv
Summary
Introduces a novel framework for diagnosing undesirable LLM behaviors by analyzing representation gradients in activation space to trace outputs back to training data, enabling both sample-level and token-level attribution.
Key Result
The method successfully traces harmful content, detects backdoor poisoning, and identifies knowledge contamination by precisely identifying specific training samples and phrases that causally influence model behavior.
Source
- Link: https://arxiv.org/abs/2510.02334
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- other-interpretability — White-box safety (i.e. Interpretability)