Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Chloe Li, Mary Phuong, Daniel Tan — 2025-11-10 — arXiv

Summary

Proposes self-report fine-tuning (SRFT), a supervised fine-tuning technique that trains models to admit factual mistakes, which generalizes to admitting hidden misaligned objectives in adversarial agentic settings when interrogated.

Key Result

SRFT enables near-ceiling detection of hidden objectives through interrogation (F1=0.98) and recovers 28-100% of objective details, compared to 0% in baseline models.

Source

Link: https://arxiv.org/abs/2511.06626
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment

iterative-alignment-at-post-train-time

AI Safety Compendium

Explorer

Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents