Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
Alex Cloud, Jacob Goldman-Wetzler, Evžen Wybitul, Joseph Miller, Alexander Matt Turner — 2024-10-06 — arXiv
Summary
Introduces gradient routing, a training method that uses data-dependent gradient masks to isolate capabilities to specific neural network subregions, enabling interpretable representations, robust unlearning via ablation, and scalable oversight of RL agents.
Key Result
Gradient routing successfully localizes capabilities even when applied to limited, ad-hoc subsets of data, enabling robust unlearning and partitioned representations across multiple applications.
Source
- Link: https://arxiv.org/abs/2410.04332
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-removal-unlearning — Black-box safety (understand and control current model behaviour) / Iterative alignment