A Safety Case for a Deployed LLM: Corrigibility as a Singular Target
Ram Potham — 2025-06-24 — ODYSSEY 2025 Conference
Summary
Presents a detailed safety case for deploying a highly capable LLM trained using the Corrigibility-as-Singular-Target (CAST) strategy via Prover-Estimator debate, arguing for adequate deployment specifications, bounded error rates, mitigated impact, and stability over a defined lifetime.
Source
- Link: https://openreview.net/forum?id=mhEnJa9pNk
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- behavior-alignment-theory — Theory / Corrigibility