Steering Language Models with Weight Arithmetic

Constanza Fierro, Fabien Roger — 2025-11-07 — arXiv

Summary

Proposes contrastive weight steering, a post-training method that edits model parameters using weight arithmetic by subtracting weight deltas from two small fine-tunes to isolate behavior directions, then adding or removing these directions to modify model behavior.

Key Result

Weight steering generalizes further than activation steering for behavioral control and can mitigate undesired behavioral drift during fine-tuning while preserving task performance.

Source