Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, … (+9 more) — 2025-02-03 — arXiv (accepted to TMLR)

Summary

Proposes and tests model tampering attacks (modifications to latent activations or weights) as a complementary evaluation method for eliciting harmful LLM capabilities, comparing 5 input-space and 6 model tampering attacks against state-of-the-art unlearning methods.

Key Result

Model tampering attacks can empirically predict success of held-out input-space attacks, model resilience lies on a low-dimensional robustness subspace, and state-of-the-art unlearning methods can be undone within 16 steps of fine-tuning.

Source

Link: https://arxiv.org/abs/2502.05209
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- capability-removal-unlearning — Black-box safety (understand and control current model behaviour) / Iterative alignment
- capability-evals — Evals

AI Safety Compendium

Explorer

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents