Rethinking Safety in LLM Fine-tuning: An Optimization Perspective
Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, … (+2 more) — 2025-08-17 — arXiv
Summary
Challenges the belief that fine-tuning inevitably harms LLM safety by systematically showing that poor optimization choices rather than inherent trade-offs cause safety degradation, and proposes an exponential moving average (EMA) technique to preserve safety during fine-tuning.
Key Result
By properly selecting hyperparameters (learning rate, batch size, gradient steps) and using EMA momentum, reduced unsafe model responses from 16% to approximately 5% while maintaining utility performance across Llama models and multiple datasets.
Source
- Link: https://arxiv.org/abs/2508.12531
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment