Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda — 2024-11-21 — arXiv

Summary

Uses sparse autoencoders to discover that language models have internal representations for entity recognition - detecting whether they can recall facts about an entity - which causally influences hallucination and refusal behavior.

Key Result

Sparse autoencoders uncover meaningful directions that detect model self-knowledge about entities, and these directions causally control refusal behavior and hallucinations when intervened upon.

Source

Link: https://arxiv.org/abs/2411.14257
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sparse-coding — White-box safety (i.e. Interpretability)

sparse-coding

AI Safety Compendium

Explorer

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents