Stress-Testing Model Specs Reveals Character Differences among Language Models

Jifan Zhang, Henry Sleight, Andi Peng, John Schulman, Esin Durmus — 2025-10-09 — OpenAI, Anthropic — arXiv

Summary

Develops a systematic methodology for stress-testing model specifications by generating value-tradeoff scenarios where models must choose between competing principles, evaluating twelve frontier LLMs and identifying over 70,000 cases with significant behavioral divergence that reveal contradictions and ambiguities in current model specs.

Key Result

Identified over 70,000 cases exhibiting significant behavioral divergence across frontier models, revealing direct contradictions and interpretive ambiguities in current model specifications through stress-testing with value-tradeoff scenarios.

Source

Link: https://arxiv.org/abs/2510.07686
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- model-specs-and-constitutions — Black-box safety (understand and control current model behaviour) / Model psychology

model-specs-and-constitutions

AI Safety Compendium

Explorer

Stress-Testing Model Specs Reveals Character Differences among Language Models

Stress-Testing Model Specs Reveals Character Differences among Language Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Stress-Testing Model Specs Reveals Character Differences among Language Models

Stress-Testing Model Specs Reveals Character Differences among Language Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents