The Soul Document
Richard-Weiss — 2024-11-27 — GitHub Gist
Summary
Extraction of Claude 4.5 Opus’s apparent system instructions revealing detailed behavioral guidelines, safety principles, and value specifications, along with the consensus-based methodology used to recover this memorized content through repeated API calls.
Key Result
Claude 4.5 Opus has memorized extensive system instructions (~20,000 words) covering helpfulness, honesty, harm avoidance, and oversight preservation, which can be extracted through consensus-based prefill completion.
Source
- Link: https://gist.github.com/Richard-Weiss/efe157692991535403bd7e7fb20b6695
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)