A recent study from Bink and colleagues at the University of Regensburg and Neu-Ulm University of Applied Sciences examines what happens when you design an AI assistant to coach rather than answer—and users arrive expecting the opposite.
The research, presented at CHIIR ’26, introduces a conversational copilot grounded in digital literacy principles. Rather than providing direct responses to medical questions, the system employs Socratic questioning to scaffold information evaluation: prompting users to consider source credibility, apply lateral reading strategies, and reach their own conclusions. The design represents what the authors frame as “boosting”—enhancing user competencies rather than nudging behavior through choice architecture.
The Setup
In a pre-registered randomized controlled trial (N=261), participants were assigned to one of three conditions: a standard search interface, search augmented with an AI-generated overview, or search with both the overview and an interactive copilot. The AI overview was deliberately manipulated to present incorrect answers to medical questions, allowing researchers to detect overreliance. The copilot, powered by Gemini 2.5 Flash, introduced itself as a “Digital Literacy Copilot” designed to help users independently verify claims.
What They Found
Engagement with the copilot was remarkably high—96.74% of participants in that condition interacted with it, averaging over five conversational turns across roughly three minutes of dialogue. The qualitative analysis reveals users expressing uncertainty, seeking clarification, and demonstrating various information literacy behaviors: citing trusted sources like PubMed and the NHS, describing click restraint strategies, and articulating criteria for source reliability.
Yet the copilot did not significantly improve answer correctness compared to the baseline search condition. The authors identify a key mechanism: a “time-on-chat vs. exploration” trade-off. Participants who engaged extensively with the copilot spent less time exploring actual search results—and viewing search results was a significant predictor of correct answers. In 92% of cases, users interacted with the copilot before using the search component at all.
Perhaps more striking was the expectation mismatch. Despite the copilot’s coaching design, many users treated it as a direct answer engine. One participant’s frustrated message captures the tension: “this isnt helping, if i knew that i wouldnt need you.” Others explicitly requested that the copilot “just give me results” or compared it unfavorably to ChatGPT. The authors note that prior experience with conversational AI shaped these expectations—users arrived with mental models of AI as answer-provider, not learning partner.
The study also surfaced something the authors describe as a potential form of “intellectual humility”: participants in the copilot condition reported lower confidence in their answers than those in other conditions, even though their accuracy was comparable to baseline. The scaffolding may have heightened awareness of task complexity and knowledge gaps without translating into improved performance on the immediate task.
Boundaries of the Work
Several limitations deserve attention. The study used three pre-defined medical topics rather than participants’ own information needs, which may have reduced intrinsic motivation. The deliberately incorrect AI overview, while useful for detecting overreliance, doesn’t reflect typical search experiences. As a single-session design, the study cannot assess whether the digital literacy strategies would transfer to future search behavior or produce longer-term learning effects. The authors also acknowledge that Prolific participants may be more technologically literate than the general population.
Why This Matters to Us
This study speaks directly to questions we find ourselves circling at MPRG. The expectation mismatch the authors document—users approaching a coaching system with an answer-seeking orientation—illustrates how relational patterns developed with one class of AI system transfer to encounters with others. When users have internalized a model of conversational AI as fluent answer-provider, that framing shapes their engagement even with systems explicitly designed otherwise.
The finding that the copilot prompted metacognitive reflection without improving task performance is particularly interesting from a functional standpoint. The system appears to have induced something—awareness of complexity, recognition of knowledge gaps, perhaps a form of productive uncertainty—that registered in confidence ratings but didn’t translate to behavioral outcomes within the study’s timeframe. Whether this represents a genuine pedagogical effect that might manifest in different contexts, or simply friction without benefit, remains an open question.
We’re also struck by the authors’ framing of “friction” as potentially beneficial. The suggestion that minor effort or reflection prompts might encourage critical thinking over passive consumption aligns with broader questions about what we want from AI systems designed to support rather than replace human cognition. The tension between efficiency demands and learning goals may be inherent to this design space.
References
Bink, M., Risius, M., Kruschwitz, U., & Elsweiler, D. (2026). “Can You Tell Me?”: Designing Copilots to Support Human Judgement in Online Information Seeking. In Proceedings of the 2026 ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR ’26). ACM. https://doi.org/10.1145/3786304.3787866