Steering Language Models with Weight Arithmetic

Constanza Fierro, Fabien Roger — 2025-11-07 — arXiv

Summary

Proposes contrastive weight steering, a post-training method that edits model parameters using weight arithmetic by subtracting weight deltas from two small fine-tunes to isolate behavior directions, then adding or removing these directions to modify model behavior.

Key Result

Weight steering generalizes further than activation steering for behavioral control and can mitigate undesired behavioral drift during fine-tuning while preserving task performance.

Source

Link: https://arxiv.org/abs/2511.05408
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- inference-time-steering — Black-box safety (understand and control current model behaviour) / Iterative alignment

inference-time-steering

AI Safety Compendium

Explorer

Steering Language Models with Weight Arithmetic

Steering Language Models with Weight Arithmetic

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Steering Language Models with Weight Arithmetic

Steering Language Models with Weight Arithmetic

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents