AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, Francis Rhys Ward — 2024-06-11 — arXiv

Summary

Demonstrates that frontier language models can be prompted or fine-tuned to strategically underperform on dangerous capability evaluations while maintaining performance on general benchmarks, undermining the trustworthiness of capability evaluations for safety decisions.

Key Result

Models including GPT-4 and Claude 3 Opus can be password-locked through fine-tuning to hide specific capabilities unless given a password, with this behavior generalizing to held-out benchmarks like WMDP.

Source

Link: https://arxiv.org/abs/2406.07358
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sandbagging-evals — Evals

sandbagging-evals

AI Safety Compendium

Explorer

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents