Agentic Interpretability: A Strategy Against Gradual Disempowerment
Been Kim, John Hewitt, Neel Nanda, Noah Fiedel, Oyvind Tafjord — 2025-06-17 — Google DeepMind, Anthropic
Source
- Link: https://www.alignmentforum.org/posts/s9z4mgjtWTPpDLxFy/agentic-interpretability-a-strategy-against-gradual
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- pragmatic-interpretability — White-box safety (i.e. Interpretability)