Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, … (+2 more) — 2025-08-17 — arXiv

Summary

Challenges the belief that fine-tuning inevitably harms LLM safety by systematically showing that poor optimization choices rather than inherent trade-offs cause safety degradation, and proposes an exponential moving average (EMA) technique to preserve safety during fine-tuning.

Key Result

By properly selecting hyperparameters (learning rate, batch size, gradient steps) and using EMA momentum, reduced unsafe model responses from 16% to approximately 5% while maintaining utility performance across Llama models and multiple datasets.

Source