AI Safety Compendium

Home

❯

summaries

❯

Why Do Some Language Models Fake Alignment While Others Don't?

Why Do Some Language Models Fake Alignment While Others Don't?

27 Apr 20261 min read

Why Do Some Language Models Fake Alignment While Others Don’t?

Abhay Sheshadri, John Hughes, Julian Michael, Alex Mallen, Arun Jose, Janus, … (+1 more) — 2025-06-22

Source

  • Link: https://arxiv.org/pdf/2506.18032
  • Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
    • ai-deception-evals — Evals

Related Pages

  • ai-deception-evals

Graph View

Graph view

The interactive citation graph is desktop-only. Visit this page on a larger screen to explore how concepts, agendas, papers, and organisations link together.

  • Why Do Some Language Models Fake Alignment While Others Don’t?
  • Source
  • Related Pages

Created with Quartz v0.1.0 © 2026

  • Suggest a source
  • Connect
  • Overview
  • About (proof of concept)
  • Email feedback
  • Made by IT for Humanity