Towards eliciting latent knowledge from LLMs with mechanistic interpretability
Bartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, Neel Nanda — 2025-05-20
Source
- Link: https://arxiv.org/pdf/2505.14352
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- extracting-latent-knowledge — White-box safety (i.e. Interpretability)