Consistency Training Helps Stop Sycophancy and Jailbreaks

Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, Rohin Shah — 2025-10-31 — DeepMind, Google — arXiv

Summary

Introduces and evaluates consistency training methods that teach models to be invariant to irrelevant prompt cues, testing Bias-augmented Consistency Training (BCT) and a new method called Activation Consistency Training (ACT) on Gemini 2.5 Flash to reduce sycophancy and jailbreaking.

Key Result

Both BCT and ACT reduce sycophancy equally well in Gemini 2.5 Flash, while BCT performs better at jailbreak reduction, with the approach avoiding stale training data issues by using the model’s own responses.

Source