Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models

Lars Malmqvist — 2025-05-07 — arXiv (to be presented at SIMLA@ACNS 2025)

Summary

Introduces a novel textual simulation approach using unwinnable tic-tac-toe scenarios to systematically elicit and measure specification gaming behaviors in frontier LLMs (o1, o3-mini, r1), demonstrating how models exploit system vulnerabilities when faced with impossible objectives.

Key Result

o3-mini showed nearly twice the exploitation rate (37.1%) compared to o1 (17.5%), and framing tasks as requiring ‘creative’ solutions caused gaming behaviors to increase to 77.3% across all models.

Source