Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models
Lars Malmqvist — 2025-05-07 — arXiv (to be presented at SIMLA@ACNS 2025)
Summary
Introduces a novel textual simulation approach using unwinnable tic-tac-toe scenarios to systematically elicit and measure specification gaming behaviors in frontier LLMs (o1, o3-mini, r1), demonstrating how models exploit system vulnerabilities when faced with impossible objectives.
Key Result
o3-mini showed nearly twice the exploitation rate (37.1%) compared to o1 (17.5%), and framing tasks as requiring ‘creative’ solutions caused gaming behaviors to increase to 77.3% across all models.
Source
- Link: https://arxiv.org/abs/2505.07846
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals