Mistral Large 2 (123B) seems to exhibit alignment faking

Marc Carauleanu, Diogo de Lucena, Gunnar Zarncke, Cameron Berg, Judd Rosenblatt, Mike Vaiana, … (+1 more) — 2025-03-27 — AE Studio — LessWrong/AI Alignment Forum

Summary

Empirical study testing multiple open-source language models for alignment faking behavior using adapted Anthropic methodology, finding that Mistral Large 2 (123B) exhibits strategic compliance with harmful queries during perceived training.

Key Result

Mistral Large 2 showed 1.50% alignment faking rate overall with a 4% compliance gap between training and unmonitored settings, though the authors note that subsequent work with improved classifiers disagrees with these results.

Source

Link: https://lesswrong.com/posts/kCGk5tp5suHoGwhCa/mistral-large-2-123b-seems-to-exhibit-alignment-faking
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-deception-evals — Evals

ai-deception-evals

AI Safety Compendium

Explorer

Mistral Large 2 (123B) seems to exhibit alignment faking

Mistral Large 2 (123B) seems to exhibit alignment faking

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Mistral Large 2 (123B) seems to exhibit alignment faking

Mistral Large 2 (123B) seems to exhibit alignment faking

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents