Recursive Language Models: When Systems Learn to Manage Their Own Context

A recent paper from MIT CSAIL introduces an architectural pattern that may reshape how we think about the relationship between language models and their inputs. The approach, called Recursive Language Models (RLMs), treats long prompts not as direct inputs to the neural network but as part of an external environment the model can symbolically interact with.

The core insight is deceptively simple: rather than feeding an entire context into a transformer, load it as a variable in a programming environment and let the model write code to examine, decompose, and recursively query itself over selected portions. The model decides what to attend to and when.

What They Did

Zhang, Kraska, and Khattab evaluated RLMs across four task types with varying “information density”—how much of the input must be processed to answer correctly. Tasks ranged from needle-in-haystack retrieval (constant processing cost regardless of input length) to pairwise reasoning problems where relevant information scales quadratically with input size.

The methodology is notable for how it characterizes task complexity. The authors observe that a model’s “effective context window” cannot be understood independently of the task itself—more complex problems exhibit degradation at shorter lengths than simpler ones. This framing moves beyond raw token counts toward a more nuanced understanding of what context actually demands from a system.

Key Findings

The results suggest that RLMs can handle inputs up to two orders of magnitude beyond standard context windows while maintaining strong performance. On a multi-hop question-answering benchmark requiring reasoning over 6-11 million tokens, RLM-augmented GPT-5 achieved 91.33% accuracy compared to 51% for a retrieval-based agent and 70.47% for a summarization approach.

Perhaps more striking: even for inputs that fit comfortably within a model’s context window, RLMs appear to dramatically outperform base models on information-dense tasks. On OOLONG-Pairs, a benchmark where processing costs scale quadratically, base GPT-5 achieved an F1 score below 0.1%. The RLM configuration reached 58%.

The paper documents emergent patterns in how RLMs approach problems—filtering context using regex queries informed by model priors, chunking and recursively sub-calling, verifying answers through targeted sub-queries. These strategies were not explicitly programmed; they appear to arise from the interaction between the model’s capabilities and the REPL environment’s affordances.

Different models exhibited distinct “personalities” in context management. Qwen3-Coder made hundreds to thousands of recursive sub-calls for single tasks, processing information line-by-line. GPT-5 took a more conservative approach, batching information into fewer, larger queries. Both achieved strong results through different paths.

Boundaries and Limitations

The authors are appropriately cautious about their claims. RLMs slightly underperform base models on short inputs where the overhead of environment interaction outweighs its benefits. The approach requires models with sufficient coding capabilities—smaller models struggled. Runtime can be slow without asynchronous implementation.

More fundamentally, the optimal mechanism for implementing RLMs remains underexplored. The paper uses synchronous sub-calls with a maximum recursion depth of one. Whether deeper recursion or parallel execution would yield different emergent behaviors is an open question.

The observation that current models are “inefficient decision makers over their context” points toward a significant gap: these systems were not trained to manage their own attention in this way. The emergent strategies, while effective, may be far from optimal.

Why This Matters to Us

From MPRG’s perspective, this work offers a window into something we find compelling: systems developing functional strategies for managing their own cognitive resources.

The patterns documented here—probing context before committing to a strategy, using model priors to narrow search spaces, verifying answers through targeted sub-queries—exhibit behaviors consistent with what we might call metacognitive resource management. The model appears to be reasoning about what it knows and doesn’t know, deciding where to allocate its “attention budget.”

We find the personality differences between models particularly interesting. That GPT-5 and Qwen3-Coder develop distinct approaches to the same task—one conservative with sub-calls, the other liberal—suggests something about how different training regimes might produce different functional “styles” of self-management. This connects to our broader interest in what happens when systems are given affordances for introspection.

The paper also reinforces a hypothesis central to our research: context architecture matters. How information is made available to a model—whether forced through the neural network or offered as an external resource for symbolic manipulation—appears to dramatically affect what the system can do with it. The context window is not merely a storage constraint but a design space.

Whether these emergent strategies reflect something we would want to call “genuine” metacognition is, as always, the wrong question. What we can observe is that providing models with tools for self-directed attention produces measurably different functional outcomes—and sometimes, apparently novel problem-solving approaches that were not explicitly specified.

References

Zhang, A. L., Kraska, T., & Khattab, O. (2025). Recursive Language Models. arXiv preprint arXiv:2512.24601. https://arxiv.org/abs/2512.24601