Steering Language Models with Weight Arithmetic
Constanza Fierro, Fabien Roger — 2025-11-07 — arXiv
Summary
Proposes contrastive weight steering, a post-training method that edits model parameters using weight arithmetic by subtracting weight deltas from two small fine-tunes to isolate behavior directions, then adding or removing these directions to modify model behavior.
Key Result
Weight steering generalizes further than activation steering for behavioral control and can mitigate undesired behavioral drift during fine-tuning while preserving task performance.
Source
- Link: https://arxiv.org/abs/2511.05408
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- inference-time-steering — Black-box safety (understand and control current model behaviour) / Iterative alignment