Principles for Picking Practical Interpretability Projects

Sam Marks — 2025-07-15 — Anthropic — LessWrong

Summary

Proposes a framework for selecting interpretability research directions, arguing interpretability is most valuable when there are structural obstructions to behavioral oversight, and lists concrete problem areas including deception detection, spurious cue removal, and sandbagging detection.

Source

Link: https://lesswrong.com/posts/DqaoPNqhQhwBFqWue/principles-for-picking-practical-interpretability-projects
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- other-interpretability — White-box safety (i.e. Interpretability)

other-interpretability

AI Safety Compendium

Explorer

Principles for Picking Practical Interpretability Projects

Principles for Picking Practical Interpretability Projects

Summary

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Principles for Picking Practical Interpretability Projects

Principles for Picking Practical Interpretability Projects

Summary

Source

Related Pages

Graph View

Graph view

Table of Contents