Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals
Marius Hobbhahn — 2025-07-03 — Apollo Research — LessWrong
Summary
Apollo Research empirically tested whether their precursor evaluations for scheming (measuring agentic self-reasoning and theory of mind) predicted their direct in-context scheming evaluations, finding low to medium predictive power that is insufficient for high-stakes deployment decisions.
Key Result
Precursor evaluations showed low to medium predictive power for scheming capabilities, with easy versions somewhat useful but hard versions neutral or actively misleading, demonstrating insufficient reliability for frontier safety policies.
Source
- Link: https://lesswrong.com/posts/9tqpPP4FwSnv9AWsi/research-note-our-scheming-precursor-evals-had-limited
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-evals — Evals