Three Sketches of ASL-4 Safety Case Components
2024-01-01 — Anthropic — Alignment Science Blog
Summary
Presents three hypothetical safety case structures for ASL-4 models that have triggered sabotage capability evaluations: mechanistic interpretability using SAE features, AI control through monitoring and honeypots, and incentives analysis to rule out training-induced deception.
Source
- Link: https://alignment.anthropic.com/2024/safety-cases/index.html
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)