Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda — 2025-07-22 — Google DeepMind — arXiv

Summary

Introduces Concept Ablation Fine-Tuning (CAFT), a technique that ablates undesired concept directions in latent space during fine-tuning to prevent unintended out-of-distribution generalization without modifying training data.

Key Result

CAFT reduces emergent misaligned responses by 10x without degrading training distribution performance, successfully steering away from unintended generalizations in three fine-tuning tasks.

Source

Link: https://arxiv.org/abs/2507.16795
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability

activation-engineering

AI Safety Compendium

Explorer

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents