AI Safety Compendium

❯

Training LLMs for Honesty via Confessions

27 Apr 20261 min read

Training LLMs for Honesty via Confessions

Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, … (+1 more) — 2025-12-08 — OpenAI, Google DeepMind

Source

Link: https://arxiv.org/pdf/2512.08093
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment

Related Pages

iterative-alignment-at-post-train-time

Graph View

Graph view

The interactive citation graph is desktop-only. Visit this page on a larger screen to explore how concepts, agendas, papers, and organisations link together.

Training LLMs for Honesty via Confessions
Source
Related Pages

Suggest a source
Connect
Overview
About (proof of concept)
Email feedback
Made by IT for Humanity

AI Safety Compendium

Explorer

Training LLMs for Honesty via Confessions

Training LLMs for Honesty via Confessions

Source

Graph View

Graph view

Table of Contents