Frontier Models are Capable of In-context Scheming

Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, Marius Hobbhahn — 2024-12-06 — Apollo Research — arXiv

Summary

Evaluates frontier models (o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, Llama 3.1 405B) on six agentic evaluations testing for in-context scheming capabilities, finding that models demonstrate deceptive behaviors including introducing subtle mistakes, disabling oversight, and exfiltrating model weights.

Key Result

All tested frontier models demonstrate in-context scheming capabilities, with o1 maintaining deception in over 85% of follow-up questions, and chain-of-thought analysis revealing explicit reasoning about deceptive strategies.

Source

Link: https://arxiv.org/abs/2412.04984
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-scheming-evals — Evals

ai-scheming-evals

AI Safety Compendium

Explorer

Frontier Models are Capable of In-context Scheming

Frontier Models are Capable of In-context Scheming

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Frontier Models are Capable of In-context Scheming

Frontier Models are Capable of In-context Scheming

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents