Tell me about yourself: LLMs are aware of their learned behaviors

Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans — 2025-01-19 — University of Oxford — arXiv

Summary

Demonstrates that LLMs can articulate their learned behaviors without explicit training to do so - models finetuned to exhibit behaviors like writing insecure code or making risky decisions can describe these behaviors spontaneously, and can sometimes detect whether they have backdoors.

Key Result

Models finetuned on behavioral datasets explicitly describe those behaviors without being trained to articulate them, and can sometimes identify whether they have a backdoor even without its trigger being present.

Source

Link: https://arxiv.org/abs/2501.11120
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- situational-awareness-and-self-awareness-evals — Evals

situational-awareness-and-self-awareness-evals

AI Safety Compendium

Explorer

Tell me about yourself: LLMs are aware of their learned behaviors

Tell me about yourself: LLMs are aware of their learned behaviors

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Tell me about yourself: LLMs are aware of their learned behaviors

Tell me about yourself: LLMs are aware of their learned behaviors

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents