Three Sketches of ASL-4 Safety Case Components

2024-01-01 — Anthropic — Alignment Science Blog

Summary

Presents three hypothetical safety case structures for ASL-4 models that have triggered sabotage capability evaluations: mechanistic interpretability using SAE features, AI control through monitoring and honeypots, and incentives analysis to rule out training-induced deception.

Source

Link: https://alignment.anthropic.com/2024/safety-cases/index.html
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)

anthropic

AI Safety Compendium

Explorer

Three Sketches of ASL-4 Safety Case Components

Three Sketches of ASL-4 Safety Case Components

Summary

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Three Sketches of ASL-4 Safety Case Components

Three Sketches of ASL-4 Safety Case Components

Summary

Source

Related Pages

Graph View

Graph view

Table of Contents