Why Do Some Language Models Fake Alignment While Others Don’t?
Abhay Sheshadri, John Hughes, Julian Michael, Alex Mallen, Arun Jose, Janus, … (+1 more) — 2025-06-22
Source
- Link: https://arxiv.org/pdf/2506.18032
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-deception-evals — Evals