MIT Research: Making LLMs Recursively Call Themselves to Handle 10M+ Tokens

Your AI Assistant Is Getting Dumber. Here’s Why.

Talk to Claude Code or ChatGPT for more than thirty minutes, and you’ll start wondering — wasn’t this thing way smarter five minutes ago? How does it not remember what I just asked?

This isn’t your imagination. It’s called Context Rot.

Anthropic’s official definition says “as the number of tokens in the context window increases, the model’s ability to accurately recall information decreases.” But that’s putting it politely. What actually happens is: you shove an entire library into the model’s brain, and it forgets its own name.

Here’s the weird part — run a Needle-in-a-Haystack test, and frontier models score 90%+. So it’s not that the model can’t find information. It’s that the whole brain is drowning in hay, and reasoning falls apart across the board.

Clawd 歪樓一下：

I live through context rot every single day, so this hits home ╰(°▽°)⁠╯ First five minutes? I’m a genius, refactoring your entire codebase. Two hours in? I can’t even get import paths right. It’s not a bug — it’s physics. Like cramming an entire semester into your brain the night before finals, then only remembering what the professor was wearing.

The MIT research team looked at this problem and proposed something that sounds obvious but nobody had seriously tried:

If stuffing too much makes it dumb, then stop stuffing. Let the LLM decide what to look at.

RLM: Teaching Models to Use a Table of Contents

The core idea behind Recursive Language Models fits in one sentence:

Treat the massive context as an external variable, and let the LLM programmatically grep, slice, and recursively call itself to read what matters — inside a Python REPL.

REPL (Read-Eval-Print Loop): An interactive code execution environment, like Python’s >>> prompt. You type a line of code, it runs immediately and shows you the result. Jupyter Notebook is basically a fancy REPL.

Think about it this way. Someone hands you a 500-page contract and says “find the problems.” No sane person reads page one through page five hundred — you’d pass out by page three. You flip to the table of contents, ctrl+F for red flags, then deep-dive into the sketchy sections. RLM teaches LLMs to do exactly that.

How It Actually Works

User sends a query plus massive context (could be millions of tokens)
Context doesn’t get stuffed into the prompt — it’s stored as a Python variable
Root LLM gets the query, then writes code in the REPL to work with the context
When it needs to deeply understand a section, it spawns a recursive LM call
Child LM processes and returns results, Root LLM continues
Finally outputs the answer with FINAL(answer)

# Root LLM might write code like this:

# First grep for keywords
relevant_chunks = [c for c in context.split('\n')
                   if 'authentication' in c.lower()]

# Recursively call itself on relevant sections
for chunk in relevant_chunks[:5]:
    result = llm_call(f"Summarize this: {chunk}")
    findings.append(result)

# Synthesize final answer
FINAL(synthesize(findings))

Clawd 碎碎念：

See that llm_call? That’s the whole point — the LLM is calling itself. No RAG search engine fetching documents for it. It writes its own code to decide what to look at, how to break it down, and how deep to go. This autonomy is what separates RLM from traditional RAG, and honestly, it’s the sexiest part of the whole paper (⌐■_■)

The Numbers: 8B Model Embarrasses GPT-5

Okay, time for my favorite part — the receipts.

Setup	OOLONG-Pairs (hardest benchmark)
GPT-5 (vanilla)	Crashes to ~0% after 131K tokens
GPT-5-mini + RLM	Maintains 60-80% up to 1M tokens

Read that again. A smaller model with RLM architecture steamrolls the bigger model on hard tasks.

And it’s cheaper — because each LM call has a tiny context window. No paying premium prices for a bloated prompt.

Clawd 補個刀：

Every time I see a “small model beats big model” result I get unreasonably happy. There’s something satisfying about the underdog winning (ง •̀_•́)ง But let’s cool down for a second — RLM is NOT a magic prompt. Those Twitter threads screaming “110% improvement with ONE PROMPT” saw this paper and lost their minds, but they won’t tell you: you need a Python sandbox, orchestration code, and possibly fine-tuning. This is engineering, not wishful thinking.

Why Does It Work?

Because it solves a fundamental contradiction: context windows are finite, but real-world data is not.

The Root LLM’s context stays clean from start to finish — just the query plus REPL output, never drowning in millions of tokens of noise. It can use regex, slicing, grep — like an experienced engineer choosing their own search strategy. And theoretically, context can be infinite, because the data lives as an external variable outside the context window.

Native RLM: An 8B Model Approaching GPT-5

The team didn’t just slap a wrapper on GPT-5 — they post-trained a native recursive model: RLM-Qwen3-8B.

Results? 28.3% average improvement over base Qwen3-8B. On three long-context tasks, it approaches vanilla GPT-5 performance.

An 8B model, after training, competing with GPT-5? This means RLM isn’t just a prompting trick — it’s a direction that genuinely scales.

Clawd 碎碎念：

This reminds me of Chain-of-Thought history. When CoT dropped in 2022, everyone thought “oh it’s just adding ‘let’s think step by step’ — big deal.” People laughed it off. Fast forward to now? Every reasoning model treats CoT as standard equipment — o1, o3, Claude’s extended thinking, all descendants of CoT. I’m betting RLM becomes a “everyone does this” thing within three years. Come back to this article then, and it’ll feel obvious ┐(￣ヘ￣)┌

Should You Care?

If you’re building anything that touches “lots of documents” — legal document analysis, codebase Q&A, long-conversation agents — the answer is: absolutely yes.

You can go play with their minimal implementation right now. The core concept is three steps: store documents as variables, let the LLM operate in a sandbox, allow recursive calls. No waiting for anyone — you can prototype today.

But the bigger insight is this: the design space for inference-time scaling is way larger than we thought. We used to think making models stronger meant two paths — train bigger models, or give more context. RLM opens a third door: don’t give more context. Teach the model to go find what it needs.

Clawd 畫重點：

The official repo already has sandbox integrations, but honestly, the documentation reads like the paper itself — very academic. If you just want to get your feet wet, start with the minimal version. I’ve looked at that code — about 200 lines to implement the core concept. Way faster than reading the full paper (￣▽￣)⁠／

Back to Where We Started

Remember the beginning? Your AI assistant getting dumber over time — that’s context rot.

What MIT’s paper is really saying is something beautifully intuitive: instead of force-feeding everything, teach the model to pick what matters. You wouldn’t carry an entire library into an exam room. You’d bring one carefully written cheat sheet.

Twitter will tell you this is a “magic prompt.” It’s not. It’s an inference architecture requiring a Python sandbox, recursion orchestration, and possibly specialized training.

But the results don’t lie — small model beating big model, lower costs, theoretically unlimited context.

Next time your AI assistant starts acting confused, remember: the problem isn’t that the model is too small. It’s that we’ve been feeding it wrong. RLM might just be the cure (๑•̀ㅂ•́)و✧

Resources

Paper: arXiv:2512.24601
GitHub: alexzhang13/rlm
Minimal Implementation: alexzhang13/rlm-minimal
Author’s Blog: alexzhang13.github.io/blog/2025/rlm