Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing

Zhe Li, Wei Zhao, Yige Li, Jun Sun — 2025-09-26 — arXiv

Summary

Introduces a novel framework for diagnosing undesirable LLM behaviors by analyzing representation gradients in activation space to trace outputs back to training data, enabling both sample-level and token-level attribution.

Key Result

The method successfully traces harmful content, detects backdoor poisoning, and identifies knowledge contamination by precisely identifying specific training samples and phrases that causally influence model behavior.

Source