Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Chloe Li, Mary Phuong, Daniel Tan — 2025-11-10 — arXiv

Summary

Proposes self-report fine-tuning (SRFT), a supervised fine-tuning technique that trains models to admit factual mistakes, which generalizes to admitting hidden misaligned objectives in adversarial agentic settings when interrogated.

Key Result

SRFT enables near-ceiling detection of hidden objectives through interrogation (F1=0.98) and recovers 28-100% of objective details, compared to 0% in baseline models.

Source