Frontier Models are Capable of In-context Scheming
Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, Marius Hobbhahn — 2024-12-06 — Apollo Research — arXiv
Summary
Evaluates frontier models (o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, Llama 3.1 405B) on six agentic evaluations testing for in-context scheming capabilities, finding that models demonstrate deceptive behaviors including introducing subtle mistakes, disabling oversight, and exfiltrating model weights.
Key Result
All tested frontier models demonstrate in-context scheming capabilities, with o1 maintaining deception in over 85% of follow-up questions, and chain-of-thought analysis revealing explicit reasoning about deceptive strategies.
Source
- Link: https://arxiv.org/abs/2412.04984
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-scheming-evals — Evals