Research - External

The Gap That Won’t Close: Looped Transformers and the Limits of Machine Introspection

A new study tests whether recursive architectures can help language models better access their own internal representations—and finds the answer more complicated than expected.


Language models often seem to know more than they can say. Their internal representations—the high-dimensional activation patterns that flow through the network—frequently encode information that doesn’t make it into their linguistic outputs. A probe trained on hidden states can detect whether an answer is correct even when the model’s verbal self-assessment gets it wrong. The ceiling of what’s in there appears higher than the ceiling of what comes out.

This gap matters for safety and alignment. If models have latent “awareness” that exceeds their ability to articulate it, then output-based monitoring will systematically underestimate what the system knows. The question becomes: can architectures be designed that help models better access and express their own internal states?

A new report from Chen, Liu, and Shao at Shanghai AI Laboratory investigates whether Looped Transformers—architectures that iterate shared layers multiple times, processing their own representations recursively—can bridge this gap. The intuition is appealing: if the model repeatedly passes its activations through the same computational blocks, perhaps it develops something like introspective access, learning to “read” and refine its own internal states.

The findings complicate this picture.

The Setup

The researchers formalize three levels of capability: task performance (can the model solve the problem?), self-verification (can it correctly judge whether its answer is right?), and representation readout (can a probe trained on hidden states detect correctness?). In principle, representation readout should upper-bound self-verification, which should upper-bound task performance—the model’s internal “knowledge” exceeds what it can verbally express, which exceeds what it can reliably act on.

They test whether increasing loop iterations in Looped Transformers narrows the gap between self-verification and representation readout. The experiments use Ouro models (1.4B and 2.6B parameters) with 1–8 loop iterations, evaluated on safety judgment (BeaverTails dataset) and mathematical verification (DeepMath).

A second experiment probes introspective awareness more directly. Following methodology from Anthropic’s introspection work, they extract “concept vectors” from the model’s representations and inject them at various points during the loop process, then ask whether the model can detect and identify the injected content. If looping enables continuous self-monitoring, injections at early loops should become recognizable as the process unfolds.

What They Found

The gap does narrow with more loops—but not in the way the researchers hoped.

Self-verification accuracy improves as loop iterations increase. This aligns with the intuition that scaling computational depth yields stronger capabilities. However, the performance of representation-based probes declines across loops. The information in the hidden states becomes less separable, harder for probes to read out.

The gap closes partly because the ceiling drops, not just because the floor rises. The authors describe this as “aligning downward”—if looping degrades representational fidelity while improving verbal output, the two converge somewhere in the middle. This isn’t the kind of bridging we might want.

The injection experiments reinforce the concern. Concept vectors inserted during intermediate loops go largely unnoticed. The model only reliably detects and identifies injected content when it’s inserted in the final loop. Despite the recursive architecture, there’s no evidence of continuous monitoring—the model doesn’t appear to attend to its own representations throughout the process. Whatever integration happens, it happens at the end.

What This Doesn’t Settle

The authors are careful to note the scope of their study. These experiments use a single implementation of Looped Transformers (Ouro), and the observed limitations may not generalize to all recursive architectures. Different training objectives or architectural refinements might yield different results. The paper is explicitly preliminary—a first empirical check on an intuitive hypothesis, not a verdict on the paradigm.

The degradation of probe performance across loops is striking, but its interpretation isn’t obvious. One possibility is that the representations are genuinely losing information. Another is that they’re reorganizing in ways that break the linear separability assumptions of the probes without actually losing content. The paper doesn’t distinguish between these.

The injection methodology, borrowed from Anthropic’s introspection work, also carries assumptions. Detecting an injected concept vector isn’t the same as having general introspective access to representations. The model might be learning something about its internal states without that learning manifesting in this particular experimental paradigm.

Why This Matters to Us

This paper sits squarely within MPRG’s interests. The gap between internal representation and linguistic output is one of the central puzzles in understanding human-AI interaction. When we interact with a language model, we’re interacting with its outputs—but those outputs may systematically underrepresent what the system has encoded. This has implications for how we interpret model behavior, how we design safety monitoring, and how we think about the nature of machine self-knowledge.

The hope that recursive processing might enable something like introspection—the model learning to read its own states—is intuitive and, if it worked, would have significant implications. The finding that it doesn’t work straightforwardly, at least in this implementation, is worth taking seriously.

The “aligning downward” framing is particularly interesting. If you want to close the gap between representation and output, there are two directions: lift the output to match the representation, or degrade the representation to match the output. The former is what we’d want for safety and capability; the latter is a kind of alignment that loses information rather than expressing it. The observation that looping partly achieves the latter is a useful caution against assuming that architectural innovations automatically yield the improvements we hope for.

We’re also interested in the locality finding—that the model only integrates representational information in the final loop. This suggests that whatever “introspection” might look like in these systems, it isn’t continuous self-monitoring. The model doesn’t seem to be watching its own processing unfold. Whether that’s a limitation of current training, architecture, or something more fundamental remains open.


References

Chen, G., Liu, D., & Shao, J. (2026). Loop as a Bridge: Can Looped Transformers Truly Link Representation Space and Natural Language Outputs? arXiv. https://arxiv.org/abs/2601.10242

Lindsey, J. (2025). Emergent Introspective Awareness in Large Language Models. Transformer Circuits Thread. https://transformer-circuits.pub/2025/introspection/index.html

Zhu, R.-J., Wang, Z., Hua, K., Zhang, T., Li, Z., Que, H., … & Xing, H. (2025). Scaling Latent Reasoning via Looped Language Models. arXiv. https://arxiv.org/abs/2510.25741