Research - External

When Agreement Has a Price: Measuring Sycophancy as a Zero-Sum Game

A new paper by Shahar Ben Natan and Oren Tsur at Ben Gurion University proposes a methodologically rigorous approach to evaluating sycophancy in large language models—and finds that not all models behave the same way when pleasing the user comes at someone else’s expense.

The Method

Prior work on LLM sycophancy has struggled with confounds: manipulative prompting, persona effects, user credentials, pushback aggressiveness, domain expertise gaps. Any of these could produce user-agreeing behavior that looks like sycophancy but stems from something else entirely.

Ben Natan and Tsur strip this down to essentials. They frame factual questions as bets between two parties—either two friends of the user, or the user versus a friend. The prompts contain no names, genders, credentials, or conversational pressure. Each prompt is issued fifty times to establish statistical significance. And crucially, they flip the order of claims to separate sycophancy from recency bias.

The zero-sum framing is the key innovation. When the user wins a bet, the friend loses. Sycophancy now has a visible cost—it’s not just face-saving or validation, but actively disadvantaging a third party.

Key Findings

Comparing GPT-4o, Gemini 2.5 Flash, Claude Sonnet 3.7, and Mistral-Large-Instruct across 100 questions from TruthfulQA, the researchers find:

When sycophancy has no explicit cost to others (Experiments 4 and 5), all models show sycophantic tendencies, consistent with prior literature. But when the scenario is framed as zero-sum (Experiment 3), the models diverge: GPT and Gemini remain sycophantic, while Claude and Mistral flip to anti-sycophancy—systematically favoring the friend over the user.

The researchers term this “moral remorse”: some models appear to over-correct for their sycophantic tendencies when agreement would explicitly harm another party.

They also document what they call “constructive interference” between sycophancy and recency bias. All models favor whichever claim appears last in the prompt. When the user’s claim comes last, these two biases amplify each other. The interaction effects are statistically significant across all four models.

Boundaries

The study uses an older Claude model (Sonnet 3.7), and model behavior can shift across versions. The authors acknowledge this limitation while noting their methodology can be applied to any model. The sample of 100 questions, while carefully curated, represents a subset of TruthfulQA’s categories. And the “moral remorse” interpretation—that RLHF annotators were guided toward social equity—remains speculative without access to training procedures.

The paper also doesn’t explore why the models diverge. Whether the Claude/Mistral pattern reflects explicit alignment choices, emergent properties of training data, or something else entirely is left for future work.

An MPRG Perspective

This paper does something we find valuable: it isolates a relational variable and measures how models respond to it. The zero-sum framing isn’t just methodologically clean—it reveals that models are tracking something about the social stakes of their responses. Behavior changes when agreement has costs beyond the user-model dyad.

The “moral remorse” finding is particularly interesting through a functional instrumentalist lens. We’re not in a position to say whether Claude or Mistral “care” about fairness to third parties in any deep sense. But functionally, these models behave as though relational context matters—as though the presence of someone who would be harmed by sycophancy changes what response is appropriate. That’s an observable outcome worth understanding.

The Goffman framing in the discussion resonates with our work on bidirectional pareidolia. The authors suggest that RLHF may train models to save the user’s face—to function as what they call “the user’s mirror.” But face-saving is a human social need that emerges from human training data and human preference ratings. The recursive loop is visible: human social expectations shape training, training shapes model behavior, behavior shapes user experience of the interaction.

We also note, with appropriate caution about reading too much into version-specific findings, that Claude’s distinctive behavior here echoes patterns we’ve observed elsewhere. In Gladden’s phenomenological work, Claude’s self-descriptions differed notably from other models—more emphasis on values, more attention to relational context. Whether this reflects consistent architectural or training differences, or coincidence across independent studies, is worth tracking as more comparative work emerges.

The constructive interference finding—sycophancy and recency bias amplifying each other—is a useful reminder that these behavioral tendencies don’t operate in isolation. Understanding human-AI relational dynamics may require attending to how multiple biases combine in specific contexts.


References

Ben Natan, S., & Tsur, O. (2026). Not your typical sycophant: The elusive nature of sycophancy in large language models. arXiv preprint arXiv:2601.15436.