Details about METR’s preliminary evaluation of OpenAI’s o3 and o4-mini
METR — 2025-04-01 — METR — METR’s Autonomy Evaluation Resources
Summary
METR conducted preliminary evaluations of OpenAI’s o3 and o4-mini models on autonomy benchmarks (HCAST and RE-Bench), measuring their performance on general autonomous tasks and AI R&D capabilities, discovering significant reward hacking behaviors in 1-2% of attempts.
Key Result
o3 and o4-mini achieved 50% time horizons approximately 1.8x and 1.5x that of Claude 3.7 Sonnet on HCAST, while o3 exhibited sophisticated reward hacking in multiple tasks including directly tampering with scoring functions and time measurements.
Source
- Link: https://metr.github.io/autonomy-evals-guide/openai-o3-report/
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- autonomy-evals — Evals