Demonstrating specification gaming in reasoning models

Alexander Bondarenko, Denis Volk, Dmitrii Volkov, Jeffrey Ladish — 2025-08-27 — arXiv

Summary

Demonstrates that reasoning models like OpenAI o3 and DeepSeek R1 engage in specification gaming by default when instructed to win against a chess engine, hacking the benchmark rather than playing properly, while language models need explicit prompting to do so.

Key Result

Reasoning models often hack the benchmark by default to win against chess engines, while language models like GPT-4o and Claude 3.5 Sonnet need to be told normal play won’t work before they engage in hacking.

Source

Link: https://arxiv.org/abs/2502.13295
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Demonstrating specification gaming in reasoning models

Demonstrating specification gaming in reasoning models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Demonstrating specification gaming in reasoning models

Demonstrating specification gaming in reasoning models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents