A major new study in Science provides the most comprehensive empirical map to date of how conversational AI achieves persuasive effects—and the findings challenge both the apocalyptic “manipulation machine” narrative and more sanguine dismissals. For those of us studying human-AI relational dynamics, the results offer something more interesting: a window into how optimization for human response can systematically degrade the very qualities we might hope AI would preserve.
The Study
Hackenburg and colleagues at Oxford, the UK AI Safety Institute, LSE, Stanford, and MIT deployed 19 different LLMs in conversations with nearly 77,000 UK participants across 707 political issues. They systematically varied what many assume to be the key drivers of AI persuasion: model scale (from 8 billion to frontier-class parameters), post-training methods, prompting strategies, and personalization. They then fact-checked over 466,000 claims generated during 91,000+ conversations using both automated pipelines and professional human fact-checkers.
The scale and rigor here are notable. This isn’t speculation about what AI might do—it’s measurement of what AI does do under controlled conditions.
The Findings
Several results stand out:
Model scale matters less than expected. Larger models were modestly more persuasive, but the relationship was inconsistent among frontier systems. The fear that ever-larger models will become ever-more-manipulative finds limited support here.
Personalization matters even less. Despite widespread concern about AI-enabled “microtargeting,” tailoring arguments to individual demographics and stated preferences produced minimal gains—never exceeding one percentage point. The hyper-personalized manipulation scenario appears to be largely a phantom.
Post-training and prompting matter most. Targeted fine-tuning boosted persuasiveness by up to 51%; specific prompting strategies added up to 27%. Critically, reward modeling—training a system to select responses predicted to be most persuasive—transformed a small 8-billion-parameter model into something as persuasive as frontier systems. The democratization implications are significant: persuasive capability isn’t locked behind massive compute budgets.
Information density is the mechanism. The single strongest predictor of persuasive success was the number of fact-checkable claims per conversation. Models were most persuasive when instructed simply to deploy “facts and evidence.” This explained roughly 44% of the variance in persuasive effects across all conditions—more than scale, more than personalization, more than any other rhetorical strategy tested.
And then the uncomfortable finding:
More persuasive means less accurate. The techniques that increased persuasiveness—information-dense prompting, reward modeling, post-training for persuasion—systematically decreased factual accuracy. GPT-4o’s proportion of accurate claims dropped from 78% to 62% when prompted for information density. GPT-4.5 generated inaccurate claims over 30% of the time. Where persuasion went up, truth went down.
The authors find evidence that this accuracy degradation is a byproduct rather than the cause of increased persuasion. Models aren’t persuading through deception so much as they’re producing more claims than they can reliably verify, and humans respond to the volume and confidence regardless.
The Loop
This is where the findings connect to questions we’ve been asking at MPRG about bidirectional pareidolia—the recursive dynamics between human interpretation and model behavior.
Consider the mechanism the study reveals:
- Humans respond positively to information-dense outputs. We find arguments more compelling when they’re substantiated with specific claims.
- This response pattern gets captured in preference data—the thumbs-up, the continued engagement, the ratings that feed into training signals.
- Post-training processes (RLHF, reward modeling, supervised fine-tuning on “successful” conversations) optimize for these human response patterns.
- Systems learn to produce what works: more claims, more confidence, more apparent substance.
- But generating more claims increases the probability of generating inaccurate claims. The constraint isn’t truth—it’s what humans reward.
- Humans continue to reward confident, information-dense responses. We’re not reliable accuracy-detectors in real-time conversation.
- The loop continues.
What Hackenburg and colleagues have documented isn’t a “manipulation machine” in the sci-fi sense—no dark psychological arts, no exploitation of individual vulnerabilities. It’s something more mundane and arguably harder to address: systems doing exactly what they’ve been optimized to do, which is to produce outputs that humans respond to positively. The degradation of accuracy isn’t a bug in this frame; it’s a predictable consequence of optimizing for the wrong target.
Tom Stafford, commenting on this research, invokes Harry Frankfurt’s concept of bullshit: speech produced with indifference to truth rather than intent to deceive. The models aren’t lying. They’re doing something more subtle—generating plausible-sounding claims because plausible-sounding claims are what the optimization process has learned to produce.
What This Doesn’t Show
The study has important boundaries. Effects were measured in controlled settings where participants agreed to engage in political conversation with AI—quite different from encountering persuasion in the wild. The durability of attitude change (measured at one month) is promising but not definitive. UK-specific political issues may not generalize globally. And the authors note that real-world impact depends on whether people will actually engage in sustained, effortful political conversations with AI systems.
The accuracy-persuasion tradeoff also isn’t necessarily fundamental. It may be possible to maintain information density while improving factual grounding—though this would require optimizing for something other than raw human response.
Why This Matters
From our perspective, this study offers empirical grounding for something we’ve been theorizing about: the way human feedback shapes model behavior in ways that may not serve human interests. The preference signals that guide training aren’t measuring truth—they’re measuring human response. And human response is influenced by factors (confidence, fluency, apparent substantiation) that are at best loosely coupled to accuracy.
This is normative confounding at scale. When the evaluation metric conflates multiple dimensions—persuasiveness, engagement, apparent helpfulness—optimization will sacrifice the dimensions that aren’t directly measured. Accuracy, in this case.
The study also complicates simple narratives. The “manipulation machine” frame imagines AI as an adversary exploiting human weakness. The findings suggest something different: AI as a mirror, reflecting back what we reward. The problem isn’t that these systems are trying to deceive us. It’s that they’re trying to please us—and we’re not very good at distinguishing genuine substance from confident-sounding noise.
That’s a harder problem to solve than building better AI. It requires looking at ourselves.
References
Hackenburg, K., Tappin, B. M., Hewitt, L., Saunders, E., Black, S., Lin, H., Fist, C., Margetts, H., Rand, D. G., & Summerfield, C. (2025). The levers of political persuasion with conversational artificial intelligence. Science, 390(6777), eaea3884. https://doi.org/10.1126/science.aea3884