Tamper-Resistant Safeguards for Open-Weight LLMs

Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, … (+9 more) — 2024-08-01 — UC Berkeley, UIUC, Center for AI Safety — arXiv

Summary

Develops TAR, a method for building tamper-resistant safeguards into open-weight LLMs that prevents adversaries from removing safety mechanisms even after hundreds of fine-tuning steps, addressing the vulnerability where existing refusal and unlearning safeguards can be trivially removed.

Key Result

TAR greatly improves tamper-resistance of safeguards while preserving benign capabilities, preventing removal through extensive fine-tuning attacks.

Source