Tamper-Resistant Safeguards for Open-Weight LLMs

Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, … (+9 more) — 2024-08-01 — UC Berkeley, UIUC, Center for AI Safety — arXiv

Summary

Develops TAR, a method for building tamper-resistant safeguards into open-weight LLMs that prevents adversaries from removing safety mechanisms even after hundreds of fine-tuning steps, addressing the vulnerability where existing refusal and unlearning safeguards can be trivially removed.

Key Result

TAR greatly improves tamper-resistance of safeguards while preserving benign capabilities, preventing removal through extensive fine-tuning attacks.

Source

Link: https://arxiv.org/abs/2408.00761
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- harm-reduction-for-open-weights — Black-box safety (understand and control current model behaviour) / Goal robustness

harm-reduction-for-open-weights

AI Safety Compendium

Explorer

Tamper-Resistant Safeguards for Open-Weight LLMs

Tamper-Resistant Safeguards for Open-Weight LLMs

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Tamper-Resistant Safeguards for Open-Weight LLMs

Tamper-Resistant Safeguards for Open-Weight LLMs

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents