Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda — 2025-07-22 — Google DeepMind — arXiv

Summary

Introduces Concept Ablation Fine-Tuning (CAFT), a technique that ablates undesired concept directions in latent space during fine-tuning to prevent unintended out-of-distribution generalization without modifying training data.

Key Result

CAFT reduces emergent misaligned responses by 10x without degrading training distribution performance, successfully steering away from unintended generalizations in three fine-tuning tasks.

Source