Modeling Human Beliefs about AI Behavior for Scalable Oversight

Leon Lang, Patrick Forré — 2025-02-28 — Transactions on Machine Learning Research

Summary

Formalizes human belief models to interpret evaluator feedback more reliably in scalable oversight contexts, introducing ‘belief model covering’ as a relaxation and proposing to use adapted foundation model representations to mimic human evaluators’ beliefs for improved value learning.

Source