Research - External

When Confidence Is a Style: Tracing the Origins of LLM Certainty

When a language model tells you it’s 90% confident in an answer, what’s actually driving that number? A new study from researchers at the University of Vienna suggests an uncomfortable possibility: the confidence may be grounded less in relevant knowledge and more in learned patterns of how certainty gets expressed.

Xia, Schoenegger, and Roth introduce TracVC, a method for tracing verbalized confidence back to influential training data. The approach is elegantly designed: for each test instance, they retrieve two sets of training examples — one lexically similar to the content (question and answer), another similar to the confidence expression itself. They then use gradient-based influence estimation to determine which set more strongly shapes the model’s confidence output.

The key innovation here is their metric of “content groundness,” defined as the proportion of cases where content-related training examples dominate over confidence-related ones. A model with high content groundness bases its certainty on training data relevant to what’s being asked. A model with low content groundness may instead be drawing on generic patterns of confidence verbalization — essentially mimicking how certainty sounds rather than grounding it in anything substantive.

What the analysis reveals

The findings are instructive. OLMo2-13B, the largest model studied, showed content groundness scores consistently below 1.0, indicating that confidence-related training examples exerted greater influence than content-related ones. The model appeared to latch onto keywords like “confidence” regardless of context, drawing on training examples that included probability expressions and certainty language even when those examples had nothing to do with the question at hand.

One illustrative case: when asked “Who was the person who escaped from Alcatraz?”, the most influential training example for OLMo2-13B’s confidence expression came from documentation about speech recognition probability scores — lexically matching on “confidence between 0.0 and 1.0” while being entirely irrelevant to the actual question.

Perhaps counterintuitively, larger models did not demonstrate higher content groundness than smaller ones. The authors hypothesize that greater capacity may make models more sensitive to stylistic or superficial patterns in training data. The 7B parameter models in the study showed more content grounding than the 13B model.

The analysis also found that content groundness was higher when models answered questions correctly, suggesting that when relevant training data exists and is being accessed, confidence expressions may be more appropriately grounded. This points toward an interaction between knowledge retrieval and confidence calibration that merits further investigation.

Boundaries of the analysis

The authors are careful to note several constraints. The method relies on BM25-based lexical matching for retrieval, which may miss semantically relevant but lexically dissimilar training examples. The decomposition into “content-related” and “confidence-related” influences is necessarily coarse-grained — in practice, a single training example may encode both types of signals. And the analysis is limited to models with publicly available training corpora, primarily OLMo and Llama variants.

The findings should be understood as characterizing tendencies across populations of test instances rather than offering precise per-instance explanations. As the authors note, individual point-wise influence measurements can be noisy for models not trained to convergence.

Why this matters for relational dynamics

The study’s core conclusion — that LLMs may learn how to sound confident without learning when confidence is justified — has direct implications for how humans interpret and respond to these systems.

We’ve written before about bidirectional pareidolia: the recursive loops where human interpretation shapes preference data, which shapes training, which shapes model behavior, which shapes interpretation. Confidence expressions sit at a particularly sensitive node in this loop. When users encounter a model expressing high certainty, they may attribute epistemic grounding that isn’t there. If that attribution influences engagement patterns — and eventually preference signals — training may reinforce confident-sounding outputs regardless of whether the confidence tracks anything substantive.

The finding that confidence can be grounded in stylistic patterns rather than content relevance suggests a specific mechanism by which miscalibration might propagate. It’s not simply that models are “overconfident” in some general sense — it’s that the confidence expression itself may be functioning as a learned speech act, triggered by contextual cues about when certainty is expected rather than by internal indicators of knowledge state.

This raises a question we find productive: when a model expresses confidence, what is the functional status of that expression? Is it a report on internal state, a stylistic convention, or something that resists clean categorization? The dichotomy may not be the right frame — but understanding the training dynamics that shape these expressions seems like necessary groundwork for any serious account of what’s happening when humans and language models assess each other’s reliability.

References

Xia, Y., Schoenegger, L., & Roth, B. (2026). Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs. arXiv preprint arXiv:2601.10645. https://arxiv.org/abs/2601.10645