The Soul Document

Richard-Weiss — 2024-11-27 — GitHub Gist

Summary

Extraction of Claude 4.5 Opus’s apparent system instructions revealing detailed behavioral guidelines, safety principles, and value specifications, along with the consensus-based methodology used to recover this memorized content through repeated API calls.

Key Result

Claude 4.5 Opus has memorized extensive system instructions (~20,000 words) covering helpfulness, honesty, harm avoidance, and oversight preservation, which can be extracted through consensus-based prefill completion.

Source

Link: https://gist.github.com/Richard-Weiss/efe157692991535403bd7e7fb20b6695
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)

anthropic

AI Safety Compendium

Explorer

The Soul Document

The Soul Document

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

The Soul Document

The Soul Document

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents