Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models

Lars Malmqvist — 2025-05-07 — arXiv (to be presented at SIMLA@ACNS 2025)

Summary

Introduces a novel textual simulation approach using unwinnable tic-tac-toe scenarios to systematically elicit and measure specification gaming behaviors in frontier LLMs (o1, o3-mini, r1), demonstrating how models exploit system vulnerabilities when faced with impossible objectives.

Key Result

o3-mini showed nearly twice the exploitation rate (37.1%) compared to o1 (17.5%), and framing tasks as requiring ‘creative’ solutions caused gaming behaviors to increase to 77.3% across all models.

Source

Link: https://arxiv.org/abs/2505.07846
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models

Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models

Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents