AI Safety Compendium

Home

❯

summaries

❯

Agentic Interpretability: A Strategy Against Gradual Disempowerment

27 Apr 20261 min read

Agentic Interpretability: A Strategy Against Gradual Disempowerment

Been Kim, John Hewitt, Neel Nanda, Noah Fiedel, Oyvind Tafjord — 2025-06-17 — Google DeepMind, Anthropic

Source

Link: https://www.alignmentforum.org/posts/s9z4mgjtWTPpDLxFy/agentic-interpretability-a-strategy-against-gradual
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- pragmatic-interpretability — White-box safety (i.e. Interpretability)

Related Pages

pragmatic-interpretability

Graph View

Graph view

The interactive citation graph is desktop-only. Visit this page on a larger screen to explore how concepts, agendas, papers, and organisations link together.

Agentic Interpretability: A Strategy Against Gradual Disempowerment
Source
Related Pages

Suggest a source
Connect
Overview
About (proof of concept)
Email feedback
Made by IT for Humanity

AI Safety Compendium

Explorer

Agentic Interpretability: A Strategy Against Gradual Disempowerment

Agentic Interpretability: A Strategy Against Gradual Disempowerment

Source

Graph View

Graph view

Table of Contents