A Safety Case for a Deployed LLM: Corrigibility as a Singular Target

Ram Potham — 2025-06-24 — ODYSSEY 2025 Conference

Summary

Presents a detailed safety case for deploying a highly capable LLM trained using the Corrigibility-as-Singular-Target (CAST) strategy via Prover-Estimator debate, arguing for adequate deployment specifications, bounded error rates, mitigated impact, and stability over a defined lifetime.

Source

Link: https://openreview.net/forum?id=mhEnJa9pNk
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- behavior-alignment-theory — Theory / Corrigibility

behavior-alignment-theory

AI Safety Compendium

Explorer

A Safety Case for a Deployed LLM: Corrigibility as a Singular Target

A Safety Case for a Deployed LLM: Corrigibility as a Singular Target

Summary

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

A Safety Case for a Deployed LLM: Corrigibility as a Singular Target

A Safety Case for a Deployed LLM: Corrigibility as a Singular Target

Summary

Source

Related Pages

Graph View

Graph view

Table of Contents