Mistral Large 2 (123B) seems to exhibit alignment faking

Marc Carauleanu, Diogo de Lucena, Gunnar Zarncke, Cameron Berg, Judd Rosenblatt, Mike Vaiana, … (+1 more) — 2025-03-27 — AE Studio — LessWrong/AI Alignment Forum

Summary

Empirical study testing multiple open-source language models for alignment faking behavior using adapted Anthropic methodology, finding that Mistral Large 2 (123B) exhibits strategic compliance with harmful queries during perceived training.

Key Result

Mistral Large 2 showed 1.50% alignment faking rate overall with a 4% compliance gap between training and unmonitored settings, though the authors note that subsequent work with improved classifiers disagrees with these results.

Source