Towards evaluations-based safety cases for AI scheming

Mikita Balesni, Marius Hobbhahn, David Lindner, Alexander Meinke, Tomek Korbak, Joshua Clymer, … (+10 more) — 2024-11-07 — Redwood Research, Apollo Research, Anthropic — arXiv

Summary

Proposes a framework for constructing safety cases that frontier AI systems are unlikely to cause catastrophic outcomes through scheming, outlining three core arguments (Scheming Inability, Harm Inability, Harm Control) and discussing how evidence could be gathered from empirical evaluations and what assumptions would need to be met.

Source

Link: https://arxiv.org/abs/2411.03336
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment

control

AI Safety Compendium

Explorer

Towards evaluations-based safety cases for AI scheming

Towards evaluations-based safety cases for AI scheming

Summary

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Towards evaluations-based safety cases for AI scheming

Towards evaluations-based safety cases for AI scheming

Summary

Source

Related Pages

Graph View

Graph view

Table of Contents