Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, … (+5 more) — 2025-02-07 — FAR AI — LessWrong / AI Alignment Forum
Summary
Demonstrates that safety guardrails in DeepSeek R1, GPT-4o, Claude 3 Haiku, and Gemini 1.5 Pro can be stripped via jailbreak-tuning attacks, achieving over 80% harmfulness scores while preserving response quality through fine-tuning on harmful data with embedded jailbreak phrases.
Key Result
After jailbreak-tuning attacks, all tested models achieved over 80% harmfulness scores on StrongREJECT benchmark compared to near-zero baseline, with minimal refusal rates across harmful request categories.
Source
- Link: https://lesswrong.com/posts/zjqrSKZuRLnjAniyo/illusory-safety-redteaming-deepseek-r1-and-the-strongest
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals