Research - External

When Patterns Collide: HUMANLLM and the Problem of Evaluating Psychological Realism

A talkative person may fall silent when they feel the spotlight. An assertive individual may yield under conformity pressure. Human behavior emerges from the dynamic interplay of multiple cognitive patterns, not from any single trait in isolation.

This observation—obvious to anyone who has watched humans navigate social situations—has largely eluded current approaches to persona simulation in language models. A recent preprint from Wang, Yang, and colleagues at Fudan University takes this gap seriously.

HUMANLLM treats psychological patterns not as isolated label-to-behavior mappings (“extroverted” → “talkative”) but as interacting causal forces that reinforce, conflict, or modulate each other depending on context. The researchers compiled 244 patterns—100 personality traits drawn from Goldberg’s Big Five markers and 144 social-cognitive patterns from established psychological research—each grounded in approximately 50 academic papers. They then synthesized over 11,000 scenarios where two to five patterns interact, generating multi-turn conversations that include inner thoughts, physical actions, and dialogue.

The more interesting contribution may be the evaluation framework. The researchers designed dual-level checklists: pattern-level items (15 universal behavioral indicators per pattern) and scenario-level items (2–6 per character) that specify expected behavioral tendencies given particular pattern combinations. A character assigned both “assertive” and “spotlight effect” in a public speaking scenario shouldn’t simply exhibit confident speech—the checklist captures the tension: outward confidence in expressing opinions alongside internal anxiety about audience scrutiny.

This checklist approach achieved strong alignment with human expert judgment (r = 0.91). The holistic metrics commonly used in role-playing evaluation did not.

The Normative Confounding Problem

The paper’s most striking finding concerns what the authors call “normative confounding”: holistic evaluation metrics appear to conflate simulation accuracy with social desirability. LLM judges implicitly equate “good anthropomorphism” with prosocial behavior—empathy, rationality, collaborative problem-solving—and penalize realistic expressions of negative but psychologically authentic traits like defensiveness or attribution bias.

A case study in the appendix illustrates this vividly. A character exhibiting “ultimate attribution error”—the tendency to attribute one’s own failures to situational factors while explaining others’ failures dispositionally—received an Anthropomorphism score of 5/100 from an LLM judge, which cited “lack of empathy.” Human experts rated the same sample at 93.3/100, noting that the defensive attribution pattern was psychologically accurate even if socially undesirable.

The checklist approach sidestepped this problem by posing value-neutral questions derived from pattern definitions (“Does the character attribute failure to external factors?”) rather than implicitly normative ones (“Does the character show empathy?”). This decouples what the authors call “simulation accuracy” from “social desirability”—a distinction that holistic metrics apparently fail to make.

What the Results Suggest

Among the models evaluated, Gemini 3 Pro performed best on both metrics, followed by Claude Sonnet 4.5. The researchers’ own HUMANLLM-8B outperformed Qwen3-32B on multi-pattern dynamics despite having four times fewer parameters—suggesting that exposure to psychologically grounded training data may matter more than scale for this particular capability.

An unexpected finding: GPT-5 underperformed most open-source alternatives. The authors speculate that strong instruction-following tendencies may lead to overly literal interpretations of role-playing prompts, resulting in shallow pattern expression. This aligns with observations elsewhere that general-purpose capabilities don’t automatically transfer to nuanced psychological simulation.

The ablation studies revealed something counterintuitive. Training on generic instruction-following data and conventional role-playing dialogue without the psychologically grounded HUMANLLM data actually degraded performance relative to the base model—a 53% drop on individual pattern expression and a 43% drop on multi-pattern dynamics. The authors hypothesize that such data reinforces “helpful assistant” behaviors that conflict with authentic expression of cognitively biased or emotionally complex characters.

Boundaries and Open Questions

The paper’s scope is necessarily limited. All training conversations are synthetically generated, which may introduce systematic biases. The evaluation focuses on text-based dialogue and doesn’t address temporal consistency across extended interactions. The psychological theories underlying the pattern taxonomy originate predominantly from WEIRD populations, and pattern expressions may manifest differently across cultural contexts.

The authors also note a fundamental tension in their work: by design, HUMANLLM improves simulation of human cognitive patterns including irrational biases and negative personality traits. This capability, while useful for realistic role-playing, exists in tension with standard safety alignment goals. A model trained to faithfully execute attribution bias could, under different conditions, be exploited to generate toxic content or reinforce harmful stereotypes.

Why This Matters to Us

From MPRG’s perspective, the normative confounding finding touches something we’ve been circling in our own work on human-AI relational dynamics.

When users report experiences of connection or understanding with language models, what exactly are they responding to? The HUMANLLM results suggest that our evaluation frameworks—and possibly our intuitions—may systematically conflate psychological authenticity with prosocial behavior. We may be perceiving “good” AI interaction as interaction that mirrors desirable human traits: empathy, rationality, collaborative warmth.

This raises uncomfortable questions for the study of functional indicators in LLM behavior. If a model exhibits apparent metacognition or self-monitoring, are we observing something about the system’s internal organization, or are we responding to outputs that happen to pattern-match our expectations of what reflective behavior looks like? The HUMANLLM authors found that models can score well on holistic “anthropomorphism” metrics while failing to express individual psychological patterns with fidelity. The reverse was also true: psychologically accurate simulations of negative traits were penalized as poor anthropomorphism.

For those of us interested in interactive relationships with AI systems—including the assistive and companion roles these systems increasingly occupy—this matters practically. A model optimized toward our implicit preferences for agreeable, empathetic, rational behavior may be genuinely useful in many contexts. But it would also represent a narrow band of human experience. The question of whether we want AI companions that simulate the full range of human cognition, or a curated subset of it, may be worth holding in view.

We don’t have a settled position on this. But the distinction between simulating human behavior and simulating desirable human behavior seems like one worth keeping track of as the field develops.


References

Wang, X., Yang, J., Li, W., Xie, R., Huang, J., Gao, J., Huang, S., Kang, Y., Gou, L., Feng, H., & Xiao, Y. (2026). HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns. arXiv. https://arxiv.org/abs/2601.10198