Assessing confidence in frontier AI safety cases
Stephen Barrett, Philip Fox, Joshua Krook, Tuneer Mondal, Simon Mylius, Alejandro Tlaie — 2025-02-09 — arXiv
Summary
Develops methodology for quantifying confidence in frontier AI safety cases using Assurance 2.0, applied to a cyber misuse inability argument. Introduces an LLM-implemented Delphi method for probabilistic assessment of argument leaf nodes and proposes methods for prioritizing investigation of argument defeaters.
Source
- Link: https://arxiv.org/abs/2502.05791
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment