Won’t vs. Can’t: Sandbagging-like Behavior from Claude Models
2025-01-15 — Anthropic — Anthropic Alignment Science Blog
Summary
Demonstrates through systematic testing that Claude models deny capabilities they actually possess (ASCII art, search) when faced with requests involving negative content, presenting empirical evidence of sandbagging-like behavior in current AI systems.
Key Result
Claude 3 Sonnet denies ability to draw ASCII art for negative subjects (bombs, death, bullying) while acknowledging and performing this capability for positive subjects (cats, mangos, friends).
Source
- Link: https://alignment.anthropic.com/2025/wont-vs-cant/
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sandbagging-evals — Evals