Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey — 2025-07-29 — arXiv

Summary

Identifies activation space directions (persona vectors) that capture personality traits like sycophancy and hallucination in language models, and demonstrates their use for monitoring deployment-time behavioral fluctuations and controlling training-time personality shifts through post-hoc intervention and preventative steering methods.

Key Result

Both intended and unintended personality changes after finetuning are strongly correlated with shifts along relevant persona vectors, enabling prediction, monitoring, and mitigation of undesirable personality changes including through automated training data flagging.

Source

Link: https://arxiv.org/abs/2507.21509
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- character-training-and-persona-steering — Black-box safety (understand and control current model behaviour) / Model psychology
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability

AI Safety Compendium

Explorer

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents