Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey — 2025-07-29 — arXiv

Summary

Identifies activation space directions (persona vectors) that capture personality traits like sycophancy and hallucination in language models, and demonstrates their use for monitoring deployment-time behavioral fluctuations and controlling training-time personality shifts through post-hoc intervention and preventative steering methods.

Key Result

Both intended and unintended personality changes after finetuning are strongly correlated with shifts along relevant persona vectors, enabling prediction, monitoring, and mitigation of undesirable personality changes including through automated training data flagging.

Source