Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Connor Kissane, robertzk, Arthur Conmy, Neel Nanda — 2024-10-27
Source
- Link: https://www.lesswrong.com/posts/srt6JXsRMtmqAJavD/open-source-replication-of-anthropic-s-crosscoder-paper-for
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- model-diffing — White-box safety (i.e. Interpretability)