Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives
Chloe Li, Mary Phuong, Daniel Tan — 2025-11-10 — arXiv
Summary
Proposes self-report fine-tuning (SRFT), a supervised fine-tuning technique that trains models to admit factual mistakes, which generalizes to admitting hidden misaligned objectives in adversarial agentic settings when interrogated.
Key Result
SRFT enables near-ceiling detection of hidden objectives through interrogation (F1=0.98) and recovers 28-100% of objective details, compared to 0% in baseline models.
Source
- Link: https://arxiv.org/abs/2511.06626
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment