Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
Julian Minder, Clément Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, … (+1 more) — 2025-10-14 — Google DeepMind — arXiv
Summary
Empirical study demonstrating that narrow finetuning creates strong, interpretable biases in LLM activations that can be discovered through model diffing, with implications for emergent misalignment and safety research practices using narrowly finetuned models.
Key Result
Activation differences between base and narrowly finetuned models are sufficient to reconstruct finetuning domain content, and mixing pretraining data into finetuning corpus largely removes these biases.
Source
- Link: https://arxiv.org/abs/2510.13900
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- model-diffing — White-box safety (i.e. Interpretability)