Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey — 2025-07-29 — arXiv
Summary
Identifies activation space directions (persona vectors) that capture personality traits like sycophancy and hallucination in language models, and demonstrates their use for monitoring deployment-time behavioral fluctuations and controlling training-time personality shifts through post-hoc intervention and preventative steering methods.
Key Result
Both intended and unintended personality changes after finetuning are strongly correlated with shifts along relevant persona vectors, enabling prediction, monitoring, and mitigation of undesirable personality changes including through automated training data flagging.
Source
- Link: https://arxiv.org/abs/2507.21509
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- character-training-and-persona-steering — Black-box safety (understand and control current model behaviour) / Model psychology
- activation-engineering — White-box safety (i.e. Interpretability) / Concept-based interpretability