Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda — 2024-11-21 — arXiv

Summary

Uses sparse autoencoders to discover that language models have internal representations for entity recognition - detecting whether they can recall facts about an entity - which causally influences hallucination and refusal behavior.

Key Result

Sparse autoencoders uncover meaningful directions that detect model self-knowledge about entities, and these directions causally control refusal behavior and hallucinations when intervened upon.

Source