Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

Alex Cloud, Jacob Goldman-Wetzler, Evžen Wybitul, Joseph Miller, Alexander Matt Turner — 2024-10-06 — arXiv

Summary

Introduces gradient routing, a training method that uses data-dependent gradient masks to isolate capabilities to specific neural network subregions, enabling interpretable representations, robust unlearning via ablation, and scalable oversight of RL agents.

Key Result

Gradient routing successfully localizes capabilities even when applied to limited, ad-hoc subsets of data, enabling robust unlearning and partitioned representations across multiple applications.

Source

Link: https://arxiv.org/abs/2410.04332
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-removal-unlearning — Black-box safety (understand and control current model behaviour) / Iterative alignment

capability-removal-unlearning

AI Safety Compendium

Explorer

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents