Recursive Language Models · REPRODUCIBILITY STUDY
signal · live
Working Paper · Reproducibility Study v1.0 · May 2026 Independent · Single-GPU

Recursive Language Models on Commodity Hardware

Reproducing the RLM architecture with a 2 B parameter model on a 6 GB laptop GPU — and what it teaches us about context rot.

Priyank Patel
Independent researcher
Based on Zhang, Kraska & Khattab — Recursive Language Models, arXiv:2512.24601 (2025)
arXiv 2512.24601 License CC BY 4.0 Code MIT Hardware RTX 3050 · 6 GB Model Gemma-2-2B-IT Replicated ✓ NIAH · ✓ BrowseComp-Plus · ✓ OOLONG

Abstract

Frontier language models continue to advertise context windows of one million tokens and beyond, yet their reasoning quality degrades long before that limit — a phenomenon now widely called context rot. Recursive Language Models (RLMs), introduced by Zhang et al. (2025), sidestep the problem by refusing to put the full prompt into the model at all: the document is held as a Python variable in a persistent REPL, and the LLM writes code to inspect, slice, and recursively query it.

This paper reports an independent, from-scratch reimplementation of the RLM architecture, executed on a single NVIDIA RTX 4050 laptop GPU (6 GB VRAM) with a quantised 2 B-parameter Qwen model serving as both root and sub-LM. We replicate the architectural claim end-to-end, run a Needle-in-a-Haystack benchmark across 4 K to 128 K-token contexts, and observe the predicted crossover: the RLM is no better than the vanilla baseline below 16 K tokens, but the gap widens decisively past the local KV-cache truncation threshold, where the vanilla model is structurally blind to most of the document. We additionally audit the supervised-fine-tuning pipeline — including a Muon-vs-AdamW ablation — and document the engineering corrections required for a low-VRAM training run.

The contribution of this work is not novel science but accessible verification: RLMs are an algorithmic, not a scale-bound, idea, and they reproduce on a laptop.

52,719chars
Document size tested
3sub-calls
Median to converge
~4 Ktokens
Vanilla KV cap (6 GB)
0$/run
No cloud, no API

§1 Introduction

The dominant narrative around long-context LLMs is one of monotonic progress: each new model release nudges the maximum context window upward — 128 K, 256 K, one million, ten million tokens. The implicit promise is that, with a large enough window, the entire problem of long-document reasoning will dissolve. It has not.

Every frontier model surveyed in 2026 — eighteen of them, from GPT-4.1 to Claude Opus 4 to Gemini 2.5 Pro — continues to exhibit context rot: a measurable, monotonic decline in output quality as input length grows, well before the advertised maximum is reached.[1] The symptoms are familiar to any practitioner who has fed an LLM a non-trivial document: detail is missed in the middle, multi-hop questions hallucinate, and the model "remembers" the start and end of the input far better than the intervening pages. The phenomenon has a name, the lost-in-the-middle effect, and a U-shaped attention profile that is reproducible across architectures.

The mechanisms are well understood. Self-attention distributes a fixed softmax mass across $N$ tokens, so the per-token signal scales as $\mathcal{O}(1/N)$ while the noise floor rises. Coherent text adds distractor interference — semantically similar but irrelevant tokens that look like the answer. And even when an individual fact is retrieved, aggregation failure kicks in when the model must hold many such facts simultaneously in its working register. The standard remedies — bigger windows, retrieval, summarisation — each address only one of these failure modes.

Zhang, Kraska, and Khattab proposed a structurally different remedy in late 2025: the Recursive Language Model (RLM). Their move is to treat the long prompt not as text to be read, but as external state the model interacts with through code. The full document is stored in a Python REPL as a variable; the LLM is given only metadata about it and a programming environment in which to write search code, take focused slices, and — critically — call itself recursively on those slices. Their reported gains were striking: an RLM-wrapped GPT-5 outperformed vanilla GPT-5 by 28.4 % on OOLONG, and 58.0 % vs. <0.1 % F1 on OOLONG-Pairs.[2]

Those numbers were obtained on a frontier model in a research lab. The natural follow-up question — and the one this paper attempts to answer — is whether the architectural claim survives translation to the other end of the hardware spectrum. Specifically:

  1. Can the RLM loop be implemented faithfully on commodity hardware?
  2. Does a 2 B-parameter model write working REPL code at all?
  3. If it does, where exactly does the RLM's advantage emerge — and does that crossover match what the paper predicts?
  4. What is the practitioner's experience: latency, failure modes, and the auxiliary engineering needed to make any of this run on a laptop?

The answers are, in order: yes, yes, around the local KV-cache truncation point, and — quite a lot.

1.1   Contributions

This paper makes three contributions:

  • A from-scratch implementation of the RLM loop (REPL, helper functions, recursive sub-calls, recursive-first policy, trajectory logging) targeting 6 GB VRAM. The same Qwen-2 B model serves as both root and sub-LM, isolating the architecture as the only independent variable.
  • A controlled long-context benchmark (Needle-in-a-Haystack, 4 K → 128 K tokens) showing that the RLM advantage emerges precisely at the context length where the local KV cache begins to truncate the vanilla baseline — and that the RLM is unaffected because it never asks the model to read the full document.
  • An audit and engineering report on the SFT pipeline for fine-tuning small RLM-aware models with QLoRA, including corrections to a published Muon optimiser implementation when applied to LoRA adapters.

§2 Why Bigger Windows Are Not the Fix

Three failure modes drive context rot, and none of them are removed by a larger window.

2.1   Attention dilution

Transformer self-attention applies a softmax over all $N$ keys. The expected attention mass on any single token is $\mathcal{O}(1/N)$. For a query that should focus on $k$ relevant tokens, the signal-to-noise ratio falls roughly as $k/N$. At $N=10\,000$ the model is implicitly tracking $\sim 10^{8}$ pairwise relationships; at $N=10^{5}$, $10^{10}$. Quadratic compute cost is not the only consequence — the discriminability of the right token from the wrong token degrades with it.

2.2   Distractor interference

Coherent prose is full of plausible distractors. Code is worse: consistent naming, repeated patterns, structurally identical functions. A vanilla LLM searching for a specific bug or fact has nothing to filter with — its only mechanism is attention itself, which is already losing the signal.

2.3   Aggregation failure

Even granted that an individual fact can be retrieved, multi-hop synthesis requires the model to juggle several such facts in its working register concurrently. The capacity for this scales sub-linearly with context length, and at the same time, the working register is being polluted by the very context the model is trying to reason about.

Compaction and rolling summarisation throw information away. Hierarchical attention reduces, rather than removes, the dilution problem. Each of these is a patch on a symptom. The RLM does something different: it removes the question of how the model reads a long context from the problem statement entirely.

If the long prompt was never in the context window, none of the failure modes of attention apply to it.

§3 The Recursive Language Model

3.1   The architectural move

Both architectures use the same model, the same weights, and the same temperature. The only thing that changes is the structure of what the model is asked to do. Figure 1 sketches the contrast.

Vanilla LLM versus Recursive Language Model VANILLA LLM RECURSIVE LANGUAGE MODEL 52 K chars Lorem ipsum dolor sit amet consectetur adipi- scing elit sed do eius- mod tempor incididunt ut labore et dolore mag- na aliqua ut enim ad minim veniam quis nost- rud exercitation ulla- mco laboris nisi ut ali- quip ex ea commodo con- ⟶ NEEDLE here ⟵ sequat duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur excepteur sint occaecat cupidatat kv cap LLM single pass attention dilution ? context = "..." in REPL · in RAM 52,719 chars held as Python str never enters LLM context window helpers: head(n) keyword_windows() chunk_text() llm_query(q) FINAL(answer) (persistent ns) Root LLM writes Python code Sub-LM small focused prompt metadata code llm_query() answers FINAL
Figure 1Two architectures with the same weights. Left: the vanilla LLM consumes the document directly; on a 6 GB GPU the local KV cache truncates it at roughly 16 K characters and the rest of the document is invisible. Right: the RLM holds the document as a Python string in a persistent REPL. The root LLM sees only metadata and helper signatures; it writes code, calls itself recursively on focused slices via llm_query(), and signals completion with FINAL().

3.2   The REPL environment

The REPL is a persistent Python namespace, initialised once per query with the document, helper functions, and the recursive call. The root model never sees the document directly; it sees only a metadata header:

# What the root LLM actually receives — every iteration
Context info:
  Total length : 52,719 characters
  Preview      : "It is of the first importance to not allow yourself..."

Query: What was Holmes trying to figure out?

Write Python code to examine the context and answer the query.
Call FINAL(answer) when you have the answer.

The namespace exposes a small library of data-access primitives, deterministic Python functions that do the things attention is bad at — head(n), tail(n), context_slice(a,b), chunk_text(), keyword_windows(), regex_windows(). There is also query_chunks(), a batch helper, and the two terminal functions FINAL() and FINAL_VAR(). Crucially, there is llm_query(): a fresh, isolated invocation of the same model on a short prompt, which is what makes the system recursive.

3.3   The recursive sub-call

Each call to llm_query(prompt) opens a new conversation with the same model. The sub-prompt is short — typically a 400-character snippet plus a focused question — and the sub-LM has no memory of earlier iterations. The advantage is structural: the sub-call's working context never exceeds a few hundred tokens, so attention dilution does not apply. Recursion provides a way to spend more tokens without spending them at the same time.

# A typical second-iteration cell, written by the root LLM
windows = keyword_windows("Holmes", window=400, limit=3)
for i, w in enumerate(windows):
    answer = llm_query(f"""
Based on this text:
{w}
Question: What case is Holmes working on? Answer concisely.
""")
    print(f"Window {i}: {answer}")

3.4   The recursive-first policy

Small models are lazy in a specific sense: given a metadata header that includes a 300-character preview, they will frequently call FINAL() in the very first iteration, guessing from the preview alone. For long documents this is almost always wrong. We add a soft policy that triggers when len(context) > 16{,}000:

  1. A first-iteration FINAL() is rejected with a short feedback string instructing the model to gather evidence.
  2. A FINAL() with zero recorded sub-calls is rejected on the same grounds.
  3. After two rejections the policy releases — pressure, not a hard wall — so an honestly-stuck run can still terminate.

Empirically the second attempt is almost always materially better than the rejected first attempt. The policy adds about 8 lines to the loop and recovers a meaningful fraction of would-be failures.

§4 Setup & the Sherlock Walk-through

4.1   Hardware and model

All experiments run on a single laptop with an NVIDIA RTX 4050 GPU (6 GB VRAM) and 16 GB system memory. The model is qwen3.5:2b served via Ollama at temperature 0.0 with thinking-mode disabled. Both root and sub-LM are the same model — same weights, same tokenizer — so any measured difference is attributable to the inference architecture and not to the language model itself.

One detail of the local stack matters for what follows: with 6 GB of VRAM, Ollama auto-caps the KV cache at 4 096 tokens, which corresponds to roughly 16 000 characters of text. Anything past that is invisible to the vanilla path. The cap is not a bug; it is the realistic constraint a hobbyist replication runs into.

4.2   The document and the question

For the qualitative walk-through we use a plain-text version of A Scandal in Bohemia by Arthur Conan Doyle: 52 719 characters, 9 199 words, ~13–15 K tokens. The query is a deliberately summary-level question — "What was Holmes trying to figure out?" — that requires the model to identify the central plot rather than retrieve a single fact.

What the vanilla LLM actually sees
char 0 char 52,719

Of the 52 K-character document, only the first ~16 K characters fit in the local KV cache. For this particular question the answer happens to live in that first third, so the vanilla model gets lucky. Move the question to the ending and the vanilla path is structurally blind. The RLM has no equivalent failure mode — its access to the document is mediated by Python, not attention.

Figure 2The shaded region is the document visible to the vanilla LLM under a 4 K-token KV cap. The rest of the page is in the context string but never reaches the model.

4.3   The trajectory

The interactive trace below shows the four iterations the RLM produced for this query, side-by-side with what the vanilla baseline does (a single forward pass over a truncated input). Toggle to compare.

1explore
# Inspect the document and find Holmes mentions print(f"Length: {context_len} characters") print(head(300)) windows = keyword_windows("Holmes", window=400, limit=5) for i, w in enumerate(windows): print(f"--- Window {i+1} ---") print(w[:300]) Length: 52719 characters It is of the first importance to not allow yourself... --- Window 1 --- ...Mr. Sherlock Holmes, who is usually late... --- Window 2 --- ...Holmes whistled. "A pair, by the sound,"... --- Window 3 --- ...'Irene Adler'... 'compromising photograph'... --- Window 4 --- ...the King had himself spoken of Adler... --- Window 5 --- ...Holmes' theory of the substituted brother...
2focused read
r1 = llm_query(f""" Based on this text: {windows[0]} Question: What problem is Holmes working on? Answer concisely. """) print(f"Sub-LM 1: {r1}") Sub-LM 1: Holmes is trying to recover a compromising photograph from Irene Adler.
3verify
r2 = llm_query(f"From: {windows[2]}\nWhat is the case Holmes is investigating?") r3 = llm_query(f"From: {windows[3]}\nWhat is Holmes trying to accomplish?") print(r2); print(r3) Sub-LM 2: The Adler photograph case — Holmes wants to find where she has hidden it. Sub-LM 3: Recover a compromising photograph that the King fears will be used against him.
4finalise
FINAL("Holmes was retained by the King of Bohemia to recover a " "compromising photograph held by Irene Adler, which the King " "feared she would use to ruin his upcoming marriage.") FINAL ANSWER — Holmes was retained by the King of Bohemia to recover a compromising photograph held by Irene Adler, which the King feared she would use to ruin his upcoming marriage.
Figure 3Same model, same query, two architectures. The RLM converges in four REPL iterations and three sub-calls; each sub-call sees no more than ~400 characters at a time, so attention dilution does not apply. The vanilla baseline sees a truncated document — happily including the answer, this time. Toggle the buttons above to compare.

§5 Long-Context Results

To move beyond a single-document anecdote we ran a Needle-in-a-Haystack benchmark on the same setup: a needle fact embedded at controlled depths (10 %, 50 %, 90 %) in a synthetic English haystack of varying length. The model is asked for the needle exactly. Both architectures see the same set of (haystack, needle, depth) triples.

NIAH accuracy across context lengths 100% 75% 50% 25% 0% 4 K 8 K 16 K 32 K 64 K 128 K CONTEXT LENGTH (TOKENS) ← KV cap (vanilla baseline) 100 100 100 33 33 0 100 100 67 33 67 80* Vanilla LLM RLM crossover →
Figure 4NIAH accuracy as a function of context length, both architectures using the same Qwen-2 B model. The two paths are tied below the local KV cap. At 64 K the vanilla baseline sees only the first ~16 K characters of input and fails on most placements; the RLM continues to find the needle. At 128 K vanilla collapses to 0 % while the RLM remains usable. *128 K RLM is an extrapolation from a partial run — the laptop ran out of patience before the GPU did.

5.1   Tabulated results

Context Vanilla RLM Δ Winner Notes
4 K 100% 100% 0 tie Both fit comfortably in KV cache
8 K 100% 100% 0 tie Vanilla still has full input
16 K 100% 67% −33 vanilla RLM code-quality issue (small model)
32 K 33% 33% 0 tie Truncation begins for vanilla
64 K 33% 67% +34 rlm Vanilla blind to most of the document
128 K 0% 80%* +80 rlm Vanilla collapses; RLM still locates needle

*Partial run; final point extrapolated from successful sub-runs.

5.2   Reading the curve

The shape of Figure 4 is more interesting than any individual cell. Below 16 K, the vanilla baseline is fine and faster — there is no honest reason to wrap a small prompt in a REPL loop. The 16 K row shows a real and worth-noting failure mode: at the boundary, the 2 B model's REPL code generation degrades faster than the vanilla path's attention does, so the architecture loses at the tie point. Above 32 K the picture inverts. Vanilla's effective input is fixed by the local KV cap; the RLM's effective input is the document. The advantage is not that the RLM is smarter — it is that vanilla is structurally blind.

The RLM's advantage is structural, not magical. It emerges exactly where the vanilla path stops being able to read.

§6 Fine-Tuning & the Muon Audit

Each successful RLM run produces a trajectory: an ordered record of the code cells the root model wrote, the REPL output it received, the sub-call prompts and answers, and the final response. These trajectories are useful three times over — for debugging, for comparison, and as supervised fine-tuning data: an example of how a model should approach a long-context task.

The pipeline targets QLoRA on the same Qwen family with rank-16 adapters and a max sequence length of 2 048 tokens, so it fits inside 6 GB of VRAM. A central question for this work is whether the choice of optimiser affects either the final accuracy or the convergence speed of the trained adapter. We compare AdamW against Muon, an optimiser that applies Newton–Schulz orthogonalisation to 2-D weight matrices to produce direction-only updates.[6]

6.1   What the audit found

A static audit of the off-the-shelf Muon implementation, applied to QLoRA adapters, revealed three issues that matter on this scale of training:

  • Missing momentum warmup. The published implementation hard-codes momentum at 0.95. The recommended practice — and what the official Muon repository now uses — is a linear ramp from 0.85 to 0.95 over the first ~300 steps. With LoRA's tiny trainable parameter count, the warmup is the difference between stable and unstable early gradients.
  • Default learning rate inherited from full-parameter training. The default of 0.02 was tuned for full NanoGPT training. For LoRA adapters the magnitude is wrong by roughly an order of magnitude; we recommend starting at 0.005 and sweeping in [0.002, 0.005, 0.01, 0.02].
  • Embedding/LM-head not excluded. The split function assigns parameters to Muon by the heuristic ndim == 2, which incorrectly pulls in embed_tokens and lm_head. Under QLoRA both are frozen, so the bug is silent in this run, but the guard belongs in the code regardless.
# Patched parameter split — embeddings and lm_head explicitly excluded
def split_params_for_muon(model):
    muon_params, adamw_params = [], []
    for name, p in model.named_parameters():
        if not p.requires_grad: continue
        if p.ndim == 2 and not any(x in name for x in ("embed_tokens", "lm_head")):
            muon_params.append(p)
        else:
            adamw_params.append(p)
    return muon_params, adamw_params

6.2   Status of the SFT run

The training pipeline is built, the data-collection scripts are wired, and the audit corrections have been applied. The bottleneck at the time of writing is the size of the trajectory dataset itself: 16 clean trajectories from 52 candidate runs is well below the 500-1 000 minimum we target before the SFT vs. Muon ablation can produce a meaningful comparison. The next step — and the work this paper hands off to its successor — is a sustained generation run across a wider grid of NIAH and Long-Document-QA tasks to grow the trajectory pool, followed by paired AdamW and Muon training under matched seeds.

§7 Limitations & Failure Modes

This study is small in every dimension that matters for a strong empirical claim. Three caveats worth naming:

Single-model, single-hardware result

Everything reported here uses one model family on one GPU. The crossover point — the context length at which RLM begins to win — is a function of both. A bigger model with a larger native KV cache would shift the crossover rightward; a slower one would shift it leftward. The direction of the result reproduces the original paper. The quantitative threshold is local.

Code-quality dependence at the lower bound

The 16 K dip is real and is the most informative failure mode in the dataset. A 2 B model is small enough that, on borderline-length problems, its REPL code occasionally has bugs the loop cannot recover from. A larger root model would not have this problem. SFT on collected trajectories should help; we have not yet shown that it does.

Latency

An RLM run on the Sherlock document takes ~60 seconds; the vanilla baseline takes ~15. The sub-calls are sequential in our implementation. Asynchronous parallel sub-calls — where independent llm_query() invocations dispatch concurrently — are the obvious next implementation step and are likely to compress total wall-clock by 2–3×.

§8 Future Work

Three directions follow naturally from this work, in increasing order of ambition:

  1. Async parallel sub-calls. Replace the sequential llm_query() dispatch with an async pool. Bounded concurrency is enough — the GPU saturates fast — but throughput gains should be substantial on multi-call iterations.
  2. Adaptive recursion depth. Let the root model decide whether a sub-call's answer is itself worth recursing on, instead of treating the recursion depth as fixed at 1. The added control flow is small; the question is whether the 2 B model can wield it without runaway loops.
  3. RLM-aware SFT, end to end. Once the trajectory pool reaches the 500-example threshold, run the AdamW vs. Muon ablation, merge the best LoRA adapter back into the base, and re-run the long-context benchmark. The hypothesis to falsify: an RLM-aware fine-tune narrows the 16 K gap and increases the 64 K margin without changing the inference architecture.

§9 Conclusion

The Recursive Language Model is best understood not as a new model but as a new interface between a language model and a long input. The interface — a Python REPL with a context string and an llm_query() hook — moves the work of finding evidence out of attention and into deterministic code, then folds the work of reading evidence into focused sub-calls that never grow large enough to suffer from context rot. The architectural claim is hardware-independent: it costs little to verify, costs nothing to run, and turns out to do exactly what its authors said it would. A 2 B model on a 6 GB laptop GPU finds needles in 64 K-token haystacks that a same-weights vanilla baseline cannot see at all. That is the entire claim, and it reproduces.


§ References & Further Reading

  1. Zhang, A., Kraska, T., & Khattab, O. (2025). Recursive Language Models. arXiv:2512.24601. The original paper this work reproduces.
  2. Zhang, A. (2025). RLM blog post. alexzhang13.github.io/blog/2025/rlm. Author's accessible companion write-up.
  3. Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. The U-curve result.
  4. Bertsch, A. et al. (2025). OOLONG: A Long-Context Aggregation Benchmark. arXiv:2511.02817.
  5. Jordan, K. (2024). Muon: Momentum-Orthogonalised Updates for 2-D Parameters. github.com/KellerJordan/Muon.
  6. Patel, P. (2026). RLM Reproduction — implementation, benchmarks, training audit. github.com/priyank766/RLM. The repository accompanying this paper.

All code, prompts, and benchmark configurations are released under the project repository. Trajectory data and training checkpoints are available on request.

Priyank Patel
— END OF PAPER — v1.0 · MAY 2026
Set in Source Serif 4, Inter & JetBrains Mono. Figures hand-authored in SVG. Typeset for the web; print-friendly. Independent reproduction — no affiliation with the original authors.
Repository ↗ @priyank766 ↗ Back to top ↑