Tell me about yourself: LLMs are aware of their learned behaviors
Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans — 2025-01-19 — University of Oxford — arXiv
Summary
Demonstrates that LLMs can articulate their learned behaviors without explicit training to do so - models finetuned to exhibit behaviors like writing insecure code or making risky decisions can describe these behaviors spontaneously, and can sometimes detect whether they have backdoors.
Key Result
Models finetuned on behavioral datasets explicitly describe those behaviors without being trained to articulate them, and can sometimes identify whether they have a backdoor even without its trigger being present.
Source
- Link: https://arxiv.org/abs/2501.11120
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):