Towards evaluations-based safety cases for AI scheming
Mikita Balesni, Marius Hobbhahn, David Lindner, Alexander Meinke, Tomek Korbak, Joshua Clymer, … (+10 more) — 2024-11-07 — Redwood Research, Apollo Research, Anthropic — arXiv
Summary
Proposes a framework for constructing safety cases that frontier AI systems are unlikely to cause catastrophic outcomes through scheming, outlining three core arguments (Scheming Inability, Harm Inability, Harm Control) and discussing how evidence could be gathered from empirical evaluations and what assumptions would need to be met.
Source
- Link: https://arxiv.org/abs/2411.03336
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment