Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing

Zhe Li, Wei Zhao, Yige Li, Jun Sun — 2025-09-26 — arXiv

Summary

Introduces a novel framework for diagnosing undesirable LLM behaviors by analyzing representation gradients in activation space to trace outputs back to training data, enabling both sample-level and token-level attribution.

Key Result

The method successfully traces harmful content, detects backdoor poisoning, and identifies knowledge contamination by precisely identifying specific training samples and phrases that causally influence model behavior.

Source

Link: https://arxiv.org/abs/2510.02334
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- other-interpretability — White-box safety (i.e. Interpretability)

other-interpretability

AI Safety Compendium

Explorer

Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing

Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing

Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents