Consistency Training Helps Stop Sycophancy and Jailbreaks

Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, Rohin Shah — 2025-10-31 — DeepMind, Google — arXiv

Summary

Introduces and evaluates consistency training methods that teach models to be invariant to irrelevant prompt cues, testing Bias-augmented Consistency Training (BCT) and a new method called Activation Consistency Training (ACT) on Gemini 2.5 Flash to reduce sycophancy and jailbreaking.

Key Result

Both BCT and ACT reduce sycophancy equally well in Gemini 2.5 Flash, while BCT performs better at jailbreak reduction, with the approach avoiding stale training data issues by using the model’s own responses.

Source

Link: https://arxiv.org/abs/2510.27062
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- google-deepmind — Labs (giant companies)
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment

google-deepmind
iterative-alignment-at-post-train-time

AI Safety Compendium

Explorer

Consistency Training Helps Stop Sycophancy and Jailbreaks

Consistency Training Helps Stop Sycophancy and Jailbreaks

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Consistency Training Helps Stop Sycophancy and Jailbreaks

Consistency Training Helps Stop Sycophancy and Jailbreaks

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents