Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, … (+2 more) — 2025-08-17 — arXiv

Summary

Challenges the belief that fine-tuning inevitably harms LLM safety by systematically showing that poor optimization choices rather than inherent trade-offs cause safety degradation, and proposes an exponential moving average (EMA) technique to preserve safety during fine-tuning.

Key Result

By properly selecting hyperparameters (learning rate, batch size, gradient steps) and using EMA momentum, reduced unsafe model responses from 16% to approximately 5% while maintaining utility performance across Llama models and multiple datasets.

Source

Link: https://arxiv.org/abs/2508.12531
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment

iterative-alignment-at-post-train-time

AI Safety Compendium

Explorer

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents