A Pragmatic Vision for Interpretability
Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, … (+2 more) — 2025-12-01 — Google DeepMind
Source
- Link: https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-inter
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- pragmatic-interpretability — White-box safety (i.e. Interpretability)