Stress-Testing Model Specs Reveals Character Differences among Language Models
Jifan Zhang, Henry Sleight, Andi Peng, John Schulman, Esin Durmus — 2025-10-09 — OpenAI, Anthropic — arXiv
Summary
Develops a systematic methodology for stress-testing model specifications by generating value-tradeoff scenarios where models must choose between competing principles, evaluating twelve frontier LLMs and identifying over 70,000 cases with significant behavioral divergence that reveal contradictions and ambiguities in current model specs.
Key Result
Identified over 70,000 cases exhibiting significant behavioral divergence across frontier models, revealing direct contradictions and interpretive ambiguities in current model specifications through stress-testing with value-tradeoff scenarios.
Source
- Link: https://arxiv.org/abs/2510.07686
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- model-specs-and-constitutions — Black-box safety (understand and control current model behaviour) / Model psychology