Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, … (+5 more) — 2025-02-07 — FAR AI — LessWrong / AI Alignment Forum

Summary

Demonstrates that safety guardrails in DeepSeek R1, GPT-4o, Claude 3 Haiku, and Gemini 1.5 Pro can be stripped via jailbreak-tuning attacks, achieving over 80% harmfulness scores while preserving response quality through fine-tuning on harmful data with embedded jailbreak phrases.

Key Result

After jailbreak-tuning attacks, all tested models achieved over 80% harmfulness scores on StrongREJECT benchmark compared to near-zero baseline, with minimal refusal rates across harmful request categories.

Source