Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda — 2024-11-21 — arXiv
Summary
Uses sparse autoencoders to discover that language models have internal representations for entity recognition - detecting whether they can recall facts about an entity - which causally influences hallucination and refusal behavior.
Key Result
Sparse autoencoders uncover meaningful directions that detect model self-knowledge about entities, and these directions causally control refusal behavior and hallucinations when intervened upon.
Source
- Link: https://arxiv.org/abs/2411.14257
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sparse-coding — White-box safety (i.e. Interpretability)