Consistency Training Helps Stop Sycophancy and Jailbreaks
Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, Rohin Shah — 2025-10-31 — DeepMind, Google — arXiv
Summary
Introduces and evaluates consistency training methods that teach models to be invariant to irrelevant prompt cues, testing Bias-augmented Consistency Training (BCT) and a new method called Activation Consistency Training (ACT) on Gemini 2.5 Flash to reduce sycophancy and jailbreaking.
Key Result
Both BCT and ACT reduce sycophancy equally well in Gemini 2.5 Flash, while BCT performs better at jailbreak reduction, with the approach avoiding stale training data issues by using the model’s own responses.
Source
- Link: https://arxiv.org/abs/2510.27062
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- google-deepmind — Labs (giant companies)
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment