Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, … (+5 more) — 2025-02-07 — FAR AI — LessWrong / AI Alignment Forum

Summary

Demonstrates that safety guardrails in DeepSeek R1, GPT-4o, Claude 3 Haiku, and Gemini 1.5 Pro can be stripped via jailbreak-tuning attacks, achieving over 80% harmfulness scores while preserving response quality through fine-tuning on harmful data with embedded jailbreak phrases.

Key Result

After jailbreak-tuning attacks, all tested models achieved over 80% harmfulness scores on StrongREJECT benchmark compared to near-zero baseline, with minimal refusal rates across harmful request categories.

Source

Link: https://lesswrong.com/posts/zjqrSKZuRLnjAniyo/illusory-safety-redteaming-deepseek-r1-and-the-strongest
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents