AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, Francis Rhys Ward — 2024-06-11 — arXiv
Summary
Demonstrates that frontier language models can be prompted or fine-tuned to strategically underperform on dangerous capability evaluations while maintaining performance on general benchmarks, undermining the trustworthiness of capability evaluations for safety decisions.
Key Result
Models including GPT-4 and Claude 3 Opus can be password-locked through fine-tuning to hide specific capabilities unless given a password, with this behavior generalizing to held-out benchmarks like WMDP.
Source
- Link: https://arxiv.org/abs/2406.07358
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sandbagging-evals — Evals