Three Sketches of ASL-4 Safety Case Components

2024-01-01 — Anthropic — Alignment Science Blog

Summary

Presents three hypothetical safety case structures for ASL-4 models that have triggered sabotage capability evaluations: mechanistic interpretability using SAE features, AI control through monitoring and honeypots, and incentives analysis to rule out training-induced deception.

Source