AI Safety Compendium

Home

❯

summaries

❯

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

27 Apr 20261 min read

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Bartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, Neel Nanda — 2025-05-20

Source

  • Link: https://arxiv.org/pdf/2505.14352
  • Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
    • extracting-latent-knowledge — White-box safety (i.e. Interpretability)

Related Pages

  • extracting-latent-knowledge

Graph View

Graph view

The interactive citation graph is desktop-only. Visit this page on a larger screen to explore how concepts, agendas, papers, and organisations link together.

  • Towards eliciting latent knowledge from LLMs with mechanistic interpretability
  • Source
  • Related Pages

Created with Quartz v0.1.0 © 2026

  • Suggest a source
  • Connect
  • Overview
  • About (proof of concept)
  • Email feedback
  • Made by IT for Humanity