Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Julian Minder, Clément Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, … (+1 more) — 2025-10-14 — Google DeepMind — arXiv

Summary

Empirical study demonstrating that narrow finetuning creates strong, interpretable biases in LLM activations that can be discovered through model diffing, with implications for emergent misalignment and safety research practices using narrowly finetuned models.

Key Result

Activation differences between base and narrowly finetuned models are sufficient to reconstruct finetuning domain content, and mixing pretraining data into finetuning corpus largely removes these biases.

Source

Link: https://arxiv.org/abs/2510.13900
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- model-diffing — White-box safety (i.e. Interpretability)

model-diffing

AI Safety Compendium

Explorer

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents