Towards evaluations-based safety cases for AI scheming

Mikita Balesni, Marius Hobbhahn, David Lindner, Alexander Meinke, Tomek Korbak, Joshua Clymer, … (+10 more) — 2024-11-07 — Redwood Research, Apollo Research, Anthropic — arXiv

Summary

Proposes a framework for constructing safety cases that frontier AI systems are unlikely to cause catastrophic outcomes through scheming, outlining three core arguments (Scheming Inability, Harm Inability, Harm Control) and discussing how evidence could be gathered from empirical evaluations and what assumptions would need to be met.

Source