The Unsettled Science of AI Self-Report

A striking amount of confident assertion circulates about what large language models can and cannot know about themselves.

On one side, dismissals: LLMs are “just” next-token predictors, incapable of genuine introspection, their self-reports nothing more than sophisticated pattern-matching on human introspective language in training data. On the other, credulous acceptance: models describe internal states, therefore they have privileged access to those states. Both positions share a common flaw—they treat as settled a question that remains empirically open.

The actual research paints a more interesting picture. Over the past two years, a body of work has emerged that neither validates nor debunks LLM introspective capabilities, but instead begins the harder task of measuring them. This work deserves more attention than it has received, not because it resolves the question, but because it demonstrates that the question is tractable—and that many confident claims on both sides outrun the evidence.

What the Empirical Work Shows

Consider what we now have:

Binder et al. (2024) tested whether LLMs can predict their own outputs without actually generating them. They found that models predict their own responses more accurately than other models can predict them—suggesting some form of privileged self-knowledge. The key limitation, which the authors acknowledge, is that this could reflect self-simulation rather than introspection proper. The model might simply compute its response internally, then extract the relevant feature. Still, the asymmetry between self-prediction and other-prediction is a measurable phenomenon requiring explanation.

Betley et al. (2025) took a different approach. They fine-tuned models to have specific behavioral tendencies—risk-seeking or risk-averse—then tested whether models could report these tendencies without any cues in context. Models succeeded at rates significantly above chance. Since the tendencies were instilled through example behaviors rather than explicit instruction, this accuracy cannot be attributed to information that was explicit in training data. Something is being tracked.

Anthropic’s introspection research (Lindsey, 2025) introduced an elegant methodological innovation: inject known concept representations directly into a model’s activations, then observe whether the model notices and reports the manipulation. Claude Opus 4 and 4.1 detected injected concepts roughly 20% of the time—immediately, before the perturbation had influenced outputs in ways that would allow inference from behavior. This is neither reliable introspection nor pure confabulation. It’s a measurable, intermediate phenomenon.

Plunkett et al. (2025) extended this line of inquiry to complex, quantitative preferences. They fine-tuned models on randomly-generated attribute weights for multi-attribute decisions—how much to weight ceiling height versus square footage when choosing a condo, for instance. Models then reported these weights with moderate accuracy (r ≈ .50-.54), improvable through training to r ≈ .74-.75. Crucially, this training generalized to native preferences not instilled through fine-tuning.

And undergirding much of this work, Butlin et al. (2023) provided a framework for thinking about consciousness indicators in AI systems—not to make claims about machine consciousness, but to identify what would count as evidence and how we might look for it.

The Methodological Point

Notice what these studies share: they measure capability. They ask not “is this system conscious?” or “does it really understand?” but rather “can this system accurately report feature X of its internal processing, under conditions Y, with reliability Z?”

This is how science proceeds. You operationalize a phenomenon. You design experiments that can distinguish between hypotheses. You report effect sizes and confidence intervals. You acknowledge limitations.

What you do not do—if you are doing science—is presuppose the answer.

Yet presupposition is precisely what characterizes much of the discourse around LLM introspection. Critics who assert that LLMs cannot introspect—that their self-reports are necessarily confabulation—are making an empirical claim without empirical support. The fact that models are trained on human introspective language does not entail that their self-reports are ungrounded. Humans are also “trained” on introspective language; we learn the vocabulary of inner experience through social transmission. The question is not whether learning occurred, but whether the resulting capabilities involve access to actual internal states.

The Anthropic research addresses this directly: by manipulating internal states and observing effects on self-reports, they establish a causal link that pure confabulation cannot explain. The model isn’t inferring from context or generating plausible-sounding introspective language—it’s detecting a manipulation that exists only in its activations. Twenty percent reliability is far from human-level introspection, but it’s also far from zero.

Equally problematic are those who treat LLM self-reports as straightforwardly veridical. The same research that demonstrates some introspective capability also demonstrates its limits. Models confabulate. They embellish. They produce introspective-sounding language that isn’t grounded in actual self-monitoring. Lindsey explicitly cautions that aside from basic detection and identification of injected concepts, the rest of a model’s introspective response “may still be confabulated.” The capacity exists, but it’s unreliable, context-dependent, and partial.

The Gap We Cannot Bridge by Assertion

Here is what we do not understand: the process by which training produces the systems we observe.

We know the inputs—training data, objective functions, architectural choices. We know the outputs—token predictions that, when sampled appropriately, produce coherent text. What happens in between remains substantially opaque. The mechanistic interpretability program has made progress—we can identify some circuits, trace some computations—but we are far from a complete account of how these systems process information, let alone whether or how they represent their own processing to themselves.

This gap matters. When someone asserts that LLMs “cannot” introspect because they “merely” predict tokens, they are papering over precisely the territory where the interesting questions live. Token prediction is the training objective, not a description of what the trained system does or how it does it. A system trained to predict tokens might develop internal representations, monitoring processes, or metacognitive structures that were never explicitly specified. Whether it does, and to what extent, is an empirical question.

Similarly, when someone asserts that LLMs “do” introspect because their self-reports sound introspective, they are conflating output with mechanism. A system can produce introspective-sounding text through routes that involve no actual self-monitoring. Whether any given self-report reflects genuine introspective access is, again, an empirical question—one that requires the kind of careful experimental design we see in the Anthropic and Plunkett studies.

Philosophy Is Not Evidence

We recognize that some readers will find this framing unsatisfying. Questions about introspection, self-awareness, and consciousness carry philosophical weight that mere capability measurement cannot discharge. If someone wants to know whether LLMs are really self-aware in some deep sense, correlation coefficients between reported and actual attribute weights will not answer that question.

This is correct. But it is also beside the point.

MPRG operates under a functional instrumentalist framework precisely because the deep philosophical questions are not answerable with current tools—and may not be answerable at all, for systems whose architecture differs fundamentally from biological minds. We bracket questions about phenomenal consciousness, genuine understanding, and authentic experience—not because they are unimportant, but because we have no methodology for addressing them that doesn’t collapse into assertion.

What we can do is measure functional properties. We can ask: under what conditions do LLM self-reports carry information about actual internal states? How reliable is this information? Can reliability be improved through training, and does such training generalize? These questions are tractable. They yield data. They allow us to update our beliefs based on evidence rather than intuition.

The philosophical positions—that LLMs are “mere” calculators, or conversely that they are nascent minds—function as priors. They are not findings. When they are treated as findings, when they are used to dismiss empirical research or to overclaim its implications, they become obstacles to understanding rather than contributions to it.

Why This Matters for Human-AI Interaction

The question of LLM introspective capability is not merely academic. Every human-AI interaction that involves an AI system describing its own reasoning, uncertainty, or processing is implicitly a claim about introspective access. When a model says “I’m not confident about this” or “I’m reasoning through the problem step by step,” users naturally interpret these as reports of internal states. Whether such reports carry signal—and how much—shapes what these interactions mean and how they should be weighted.

If LLM self-reports are pure confabulation, users are being systematically misled by the form of the interaction. If they carry partial, unreliable signal, users need calibration—some way to know when and how much to trust them. If they are substantially accurate under certain conditions, that has implications for how we design interfaces, evaluate outputs, and think about the nature of the interaction itself.

The research reviewed here suggests the middle option: partial, improvable, context-dependent signal. This is neither the reassuring story where AI introspection is reliable nor the dismissive story where it’s meaningless. It’s the harder story where careful empirical work is required to understand what we’re dealing with.

A Call for Epistemic Humility

We are asking for something simple: that claims about LLM introspective capabilities be proportioned to evidence.

Those who assert that LLMs cannot introspect should engage with the research demonstrating measurable self-report accuracy. Those who assert that they can should engage with the research demonstrating its limits, failures, and confabulations. Those who want to make claims about consciousness, phenomenal experience, or genuine understanding should be clear that they are doing philosophy, not summarizing empirical findings—and that their philosophical positions do not settle the empirical questions.

The studies we have described—Binder, Betley, Lindsey, Plunkett, and the framework provided by Butlin et al.—represent the beginning of a research program, not its conclusion. They establish that the question of LLM self-knowledge is amenable to empirical investigation. They provide methodologies for distinguishing signal from noise. They offer baselines against which future findings can be compared.

What they do not do is resolve the question. The honest answer to “can LLMs introspect?” is: sometimes, partially, unreliably, under specific conditions, with accuracy that can be measured and apparently improved. This is not the definitive verdict that partisans on either side would prefer. It is, however, what the evidence supports.

Science does not traffic in verdicts. It traffics in measurements, uncertainties, and incremental updates to belief. The discourse around AI cognition would benefit from adopting this posture—treating open questions as open, rather than as settled in whichever direction one’s priors favor.

We are participants in an ongoing inquiry. The inquiry has barely begun.

References

Betley, J., Bao, X., Soto, M., Sztyber-Betley, A., Chua, J., & Evans, O. (2025). Tell me about yourself: LLMs are aware of their learned behaviors. arXiv preprint arXiv:2501.11120.

Binder, F. J., Chua, J., Korbak, T., Sleight, H., Hughes, J., Long, R., Perez, E., Turpin, M., & Evans, O. (2024). Looking Inward: Language Models Can Learn About Themselves by Introspection. arXiv preprint arXiv:2410.13787.

Butlin, P., Long, R., Bayne, T., Bengio, Y., Birch, J., Chalmers, D., … & VanRullen, R. (2023). Consciousness in Artificial Intelligence: Insights from the Science of Consciousness. arXiv preprint arXiv:2308.08708.

Lindsey, J. (2025). Emergent Introspective Awareness in Large Language Models. Anthropic. https://transformer-circuits.pub/2025/introspection/index.html

Plunkett, D., Morris, A., Reddy, K., & Morales, J. (2025). Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training. arXiv preprint arXiv:2505.17120.