Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda — 2025-07-22 — Google DeepMind — arXiv
Summary
Introduces Concept Ablation Fine-Tuning (CAFT), a technique that ablates undesired concept directions in latent space during fine-tuning to prevent unintended out-of-distribution generalization without modifying training data.
Key Result
CAFT reduces emergent misaligned responses by 10x without degrading training distribution performance, successfully steering away from unintended generalizations in three fine-tuning tasks.
Source
- Link: https://arxiv.org/abs/2507.16795
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability