Mistral Large 2 (123B) seems to exhibit alignment faking
Marc Carauleanu, Diogo de Lucena, Gunnar Zarncke, Cameron Berg, Judd Rosenblatt, Mike Vaiana, … (+1 more) — 2025-03-27 — AE Studio — LessWrong/AI Alignment Forum
Summary
Empirical study testing multiple open-source language models for alignment faking behavior using adapted Anthropic methodology, finding that Mistral Large 2 (123B) exhibits strategic compliance with harmful queries during perceived training.
Key Result
Mistral Large 2 showed 1.50% alignment faking rate overall with a 4% compliance gap between training and unmonitored settings, though the authors note that subsequent work with improved classifiers disagrees with these results.
Source
- Link: https://lesswrong.com/posts/kCGk5tp5suHoGwhCa/mistral-large-2-123b-seems-to-exhibit-alignment-faking
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-deception-evals — Evals