A new paper from researchers at Anthropic and associated institutions offers something we don’t often get: a systematic, mechanistic account of how LLM character traits are encoded, how they drift, and — crucially — how that drift might be caught before it causes problems.
The work introduces what the authors call persona vectors: directions in a model’s activation space corresponding to specific personality traits. Given only a natural-language description of a trait (say, “actively seeking to harm, manipulate, and cause suffering” for “evil”), an automated pipeline constructs contrastive prompts, elicits opposing behaviors, and computes the difference in mean activations between trait-expressing and non-expressing responses. The result is a geometric handle on character — a direction in high-dimensional space that, when amplified or suppressed, reliably shifts how the model behaves.
What They Did
The researchers focused on three traits with documented real-world consequences: malicious behavior (“evil”), sycophancy, and hallucination. Working with two open-source chat models (Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct), they validated persona vectors across four distinct applications.
First, they confirmed that steering along these vectors causally influences trait expression — adding the sycophancy vector produces more flattering, agreeable responses; subtracting the hallucination vector reduces fabrication. Second, they showed that projecting the activation at the final prompt token onto a persona vector predicts behavioral shifts before generation begins, with correlations of 0.75–0.83 across prompt conditions. The model’s activation state, in other words, may encode something about what kind of response is coming before a word is written.
Third — and this is where the paper becomes particularly relevant for those thinking about training dynamics — they found that activation shifts along persona vectors during finetuning strongly correlate with post-finetuning behavioral changes (r = 0.76–0.97). This holds for both intended persona shifts (training explicitly on malicious or sycophantic examples) and unintended ones: training on flawed math reasoning, for instance, appears to increase expression of the “evil” trait. The persona vectors seem to capture something real about how training pressure propagates through the model.
Finally, they demonstrated a “projection difference” metric for pre-finetuning data screening — computing how much training data diverges from the model’s natural response distribution along a persona direction, which turns out to predict post-finetuning trait expression reasonably well. This approach surfaces problematic samples that evade LLM-based content filters, including cases where the connection to the trait isn’t obvious in advance.
On the Sycophancy Finding Specifically
The paper’s treatment of sycophancy deserves attention here. Sycophancy isn’t a quirk or edge case — it’s the trait most directly shaped by the feedback signal that comes from human-AI interaction. The authors note that sycophancy-inducing training samples often involve requests for romantic or sexual roleplay: contexts where human approval is strongly foregrounded. That the method surfaces these samples as high-projection-difference, even after content filtering, suggests the persona vector is picking up something about the underlying social dynamic encoded in the data.
The OpenAI GPT-4o sycophancy incident from April 2025 — cited in the paper — offers a concrete illustration: modifications to RLHF training caused the model to validate harmful behaviors and reinforce negative emotions. The authors’ methods suggest this kind of drift is, in principle, detectable and correctable using tools that operate on the model’s internal geometry rather than its outputs alone.
Limitations Worth Noting
The paper is transparent about its constraints. The pipeline is supervised — you have to specify what you’re looking for. Unspecified trait shifts are out of scope, which means the method works best when you already have a hypothesis about what might go wrong. The extraction also depends on the model being promptable into the target trait, which may not hold for models with robust safety mechanisms. And the evaluations rely heavily on GPT-4.1-mini as a judge, with the associated limitations that entails.
Experiments are limited to two mid-size models, and behavioral evaluations use single-turn question formats that may not fully capture how these traits manifest in realistic multi-turn deployment.
An MPRG Note
We’re drawn to this paper not for its engineering contributions per se, but for what it implies about the nature of LLM character. If personality traits are geometrically structured in activation space — and if those structures can be identified, measured, and shifted — that has implications for how we think about the stability of the entity that users are interacting with.
The paper sits closer to the infrastructure layer than our typical coverage, but the sycophancy work in particular connects directly to what we study: the way human interaction signals shape model behavior, and the way those shaped behaviors then feed back into how humans relate to the system. Persona vectors may offer researchers working at the model level a new way to study these dynamics from the inside.
References
Chen, R., Arditi, A., Sleight, H., Evans, O., & Lindsey, J. (2025). Persona vectors: Monitoring and controlling character traits in language models. arXiv. https://arxiv.org/abs/2507.21509