Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Julian Minder, Clément Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, … (+1 more) — 2025-10-14 — Google DeepMind — arXiv

Summary

Empirical study demonstrating that narrow finetuning creates strong, interpretable biases in LLM activations that can be discovered through model diffing, with implications for emergent misalignment and safety research practices using narrowly finetuned models.

Key Result

Activation differences between base and narrowly finetuned models are sufficient to reconstruct finetuning domain content, and mixing pretraining data into finetuning corpus largely removes these biases.

Source