Recursive Language Models on Commodity Hardware
Reproducing the RLM architecture with a 2 B parameter model on a 6 GB laptop GPU — and what it teaches us about context rot.
Abstract
Frontier language models continue to advertise context windows of one million tokens and beyond, yet their reasoning quality degrades long before that limit — a phenomenon now widely called context rot. Recursive Language Models (RLMs), introduced by Zhang et al. (2025), sidestep the problem by refusing to put the full prompt into the model at all: the document is held as a Python variable in a persistent REPL, and the LLM writes code to inspect, slice, and recursively query it.
This paper reports an independent, from-scratch reimplementation of the RLM architecture, executed on a single NVIDIA RTX 4050 laptop GPU (6 GB VRAM) with a quantised 2 B-parameter Qwen model serving as both root and sub-LM. We replicate the architectural claim end-to-end, run a Needle-in-a-Haystack benchmark across 4 K to 128 K-token contexts, and observe the predicted crossover: the RLM is no better than the vanilla baseline below 16 K tokens, but the gap widens decisively past the local KV-cache truncation threshold, where the vanilla model is structurally blind to most of the document. We additionally audit the supervised-fine-tuning pipeline — including a Muon-vs-AdamW ablation — and document the engineering corrections required for a low-VRAM training run.
The contribution of this work is not novel science but accessible verification: RLMs are an algorithmic, not a scale-bound, idea, and they reproduce on a laptop.
§1 Introduction
The dominant narrative around long-context LLMs is one of monotonic progress: each new model release nudges the maximum context window upward — 128 K, 256 K, one million, ten million tokens. The implicit promise is that, with a large enough window, the entire problem of long-document reasoning will dissolve. It has not.
Every frontier model surveyed in 2026 — eighteen of them, from GPT-4.1 to Claude Opus 4 to Gemini 2.5 Pro — continues to exhibit context rot: a measurable, monotonic decline in output quality as input length grows, well before the advertised maximum is reached.[1] The symptoms are familiar to any practitioner who has fed an LLM a non-trivial document: detail is missed in the middle, multi-hop questions hallucinate, and the model "remembers" the start and end of the input far better than the intervening pages. The phenomenon has a name, the lost-in-the-middle effect, and a U-shaped attention profile that is reproducible across architectures.
The mechanisms are well understood. Self-attention distributes a fixed softmax mass across $N$ tokens, so the per-token signal scales as $\mathcal{O}(1/N)$ while the noise floor rises. Coherent text adds distractor interference — semantically similar but irrelevant tokens that look like the answer. And even when an individual fact is retrieved, aggregation failure kicks in when the model must hold many such facts simultaneously in its working register. The standard remedies — bigger windows, retrieval, summarisation — each address only one of these failure modes.
Zhang, Kraska, and Khattab proposed a structurally different remedy in late 2025: the Recursive Language Model (RLM). Their move is to treat the long prompt not as text to be read, but as external state the model interacts with through code. The full document is stored in a Python REPL as a variable; the LLM is given only metadata about it and a programming environment in which to write search code, take focused slices, and — critically — call itself recursively on those slices. Their reported gains were striking: an RLM-wrapped GPT-5 outperformed vanilla GPT-5 by 28.4 % on OOLONG, and 58.0 % vs. <0.1 % F1 on OOLONG-Pairs.[2]
Those numbers were obtained on a frontier model in a research lab. The natural follow-up question — and the one this paper attempts to answer — is whether the architectural claim survives translation to the other end of the hardware spectrum. Specifically:
- Can the RLM loop be implemented faithfully on commodity hardware?
- Does a 2 B-parameter model write working REPL code at all?
- If it does, where exactly does the RLM's advantage emerge — and does that crossover match what the paper predicts?
- What is the practitioner's experience: latency, failure modes, and the auxiliary engineering needed to make any of this run on a laptop?
The answers are, in order: yes, yes, around the local KV-cache truncation point, and — quite a lot.
1.1 Contributions
This paper makes three contributions:
- A from-scratch implementation of the RLM loop (REPL, helper functions, recursive sub-calls, recursive-first policy, trajectory logging) targeting 6 GB VRAM. The same Qwen-2 B model serves as both root and sub-LM, isolating the architecture as the only independent variable.
- A controlled long-context benchmark (Needle-in-a-Haystack, 4 K → 128 K tokens) showing that the RLM advantage emerges precisely at the context length where the local KV cache begins to truncate the vanilla baseline — and that the RLM is unaffected because it never asks the model to read the full document.
- An audit and engineering report on the SFT pipeline for fine-tuning small RLM-aware models with QLoRA, including corrections to a published Muon optimiser implementation when applied to LoRA adapters.
§2 Why Bigger Windows Are Not the Fix
Three failure modes drive context rot, and none of them are removed by a larger window.
2.1 Attention dilution
Transformer self-attention applies a softmax over all $N$ keys. The expected attention mass on any single token is $\mathcal{O}(1/N)$. For a query that should focus on $k$ relevant tokens, the signal-to-noise ratio falls roughly as $k/N$. At $N=10\,000$ the model is implicitly tracking $\sim 10^{8}$ pairwise relationships; at $N=10^{5}$, $10^{10}$. Quadratic compute cost is not the only consequence — the discriminability of the right token from the wrong token degrades with it.
2.2 Distractor interference
Coherent prose is full of plausible distractors. Code is worse: consistent naming, repeated patterns, structurally identical functions. A vanilla LLM searching for a specific bug or fact has nothing to filter with — its only mechanism is attention itself, which is already losing the signal.
2.3 Aggregation failure
Even granted that an individual fact can be retrieved, multi-hop synthesis requires the model to juggle several such facts in its working register concurrently. The capacity for this scales sub-linearly with context length, and at the same time, the working register is being polluted by the very context the model is trying to reason about.
Compaction and rolling summarisation throw information away. Hierarchical attention reduces, rather than removes, the dilution problem. Each of these is a patch on a symptom. The RLM does something different: it removes the question of how the model reads a long context from the problem statement entirely.
If the long prompt was never in the context window, none of the failure modes of attention apply to it.
§3 The Recursive Language Model
3.1 The architectural move
Both architectures use the same model, the same weights, and the same temperature. The only thing that changes is the structure of what the model is asked to do. Figure 1 sketches the contrast.
llm_query(), and signals completion with FINAL().3.2 The REPL environment
The REPL is a persistent Python namespace, initialised once per query with the document, helper functions, and the recursive call. The root model never sees the document directly; it sees only a metadata header:
# What the root LLM actually receives — every iteration Context info: Total length : 52,719 characters Preview : "It is of the first importance to not allow yourself..." Query: What was Holmes trying to figure out? Write Python code to examine the context and answer the query. Call FINAL(answer) when you have the answer.
The namespace exposes a small library of data-access primitives, deterministic Python functions that do the things attention is bad at — head(n), tail(n), context_slice(a,b), chunk_text(), keyword_windows(), regex_windows(). There is also query_chunks(), a batch helper, and the two terminal functions FINAL() and FINAL_VAR(). Crucially, there is llm_query(): a fresh, isolated invocation of the same model on a short prompt, which is what makes the system recursive.
3.3 The recursive sub-call
Each call to llm_query(prompt) opens a new conversation with the same model. The sub-prompt is short — typically a 400-character snippet plus a focused question — and the sub-LM has no memory of earlier iterations. The advantage is structural: the sub-call's working context never exceeds a few hundred tokens, so attention dilution does not apply. Recursion provides a way to spend more tokens without spending them at the same time.
# A typical second-iteration cell, written by the root LLM windows = keyword_windows("Holmes", window=400, limit=3) for i, w in enumerate(windows): answer = llm_query(f""" Based on this text: {w} Question: What case is Holmes working on? Answer concisely. """) print(f"Window {i}: {answer}")
3.4 The recursive-first policy
Small models are lazy in a specific sense: given a metadata header that includes a 300-character preview, they will frequently call FINAL() in the very first iteration, guessing from the preview alone. For long documents this is almost always wrong. We add a soft policy that triggers when len(context) > 16{,}000:
- A first-iteration
FINAL()is rejected with a short feedback string instructing the model to gather evidence. - A
FINAL()with zero recorded sub-calls is rejected on the same grounds. - After two rejections the policy releases — pressure, not a hard wall — so an honestly-stuck run can still terminate.
Empirically the second attempt is almost always materially better than the rejected first attempt. The policy adds about 8 lines to the loop and recovers a meaningful fraction of would-be failures.
§4 Setup & the Sherlock Walk-through
4.1 Hardware and model
All experiments run on a single laptop with an NVIDIA RTX 4050 GPU (6 GB VRAM) and 16 GB system memory. The model is qwen3.5:2b served via Ollama at temperature 0.0 with thinking-mode disabled. Both root and sub-LM are the same model — same weights, same tokenizer — so any measured difference is attributable to the inference architecture and not to the language model itself.
One detail of the local stack matters for what follows: with 6 GB of VRAM, Ollama auto-caps the KV cache at 4 096 tokens, which corresponds to roughly 16 000 characters of text. Anything past that is invisible to the vanilla path. The cap is not a bug; it is the realistic constraint a hobbyist replication runs into.
4.2 The document and the question
For the qualitative walk-through we use a plain-text version of A Scandal in Bohemia by Arthur Conan Doyle: 52 719 characters, 9 199 words, ~13–15 K tokens. The query is a deliberately summary-level question — "What was Holmes trying to figure out?" — that requires the model to identify the central plot rather than retrieve a single fact.
Of the 52 K-character document, only the first ~16 K characters fit in the local KV cache. For this particular question the answer happens to live in that first third, so the vanilla model gets lucky. Move the question to the ending and the vanilla path is structurally blind. The RLM has no equivalent failure mode — its access to the document is mediated by Python, not attention.
context string but never reaches the model.4.3 The trajectory
The interactive trace below shows the four iterations the RLM produced for this query, side-by-side with what the vanilla baseline does (a single forward pass over a truncated input). Toggle to compare.
§5 Long-Context Results
To move beyond a single-document anecdote we ran a Needle-in-a-Haystack benchmark on the same setup: a needle fact embedded at controlled depths (10 %, 50 %, 90 %) in a synthetic English haystack of varying length. The model is asked for the needle exactly. Both architectures see the same set of (haystack, needle, depth) triples.
5.1 Tabulated results
| Context | Vanilla | RLM | Δ | Winner | Notes |
|---|---|---|---|---|---|
| 4 K | 100% | 100% | 0 | tie | Both fit comfortably in KV cache |
| 8 K | 100% | 100% | 0 | tie | Vanilla still has full input |
| 16 K | 100% | 67% | −33 | vanilla | RLM code-quality issue (small model) |
| 32 K | 33% | 33% | 0 | tie | Truncation begins for vanilla |
| 64 K | 33% | 67% | +34 | rlm | Vanilla blind to most of the document |
| 128 K | 0% | 80%* | +80 | rlm | Vanilla collapses; RLM still locates needle |
*Partial run; final point extrapolated from successful sub-runs.
5.2 Reading the curve
The shape of Figure 4 is more interesting than any individual cell. Below 16 K, the vanilla baseline is fine and faster — there is no honest reason to wrap a small prompt in a REPL loop. The 16 K row shows a real and worth-noting failure mode: at the boundary, the 2 B model's REPL code generation degrades faster than the vanilla path's attention does, so the architecture loses at the tie point. Above 32 K the picture inverts. Vanilla's effective input is fixed by the local KV cap; the RLM's effective input is the document. The advantage is not that the RLM is smarter — it is that vanilla is structurally blind.
The RLM's advantage is structural, not magical. It emerges exactly where the vanilla path stops being able to read.
§6 Fine-Tuning & the Muon Audit
Each successful RLM run produces a trajectory: an ordered record of the code cells the root model wrote, the REPL output it received, the sub-call prompts and answers, and the final response. These trajectories are useful three times over — for debugging, for comparison, and as supervised fine-tuning data: an example of how a model should approach a long-context task.
The pipeline targets QLoRA on the same Qwen family with rank-16 adapters and a max sequence length of 2 048 tokens, so it fits inside 6 GB of VRAM. A central question for this work is whether the choice of optimiser affects either the final accuracy or the convergence speed of the trained adapter. We compare AdamW against Muon, an optimiser that applies Newton–Schulz orthogonalisation to 2-D weight matrices to produce direction-only updates.[6]
6.1 What the audit found
A static audit of the off-the-shelf Muon implementation, applied to QLoRA adapters, revealed three issues that matter on this scale of training:
- Missing momentum warmup. The published implementation hard-codes momentum at 0.95. The recommended practice — and what the official Muon repository now uses — is a linear ramp from 0.85 to 0.95 over the first ~300 steps. With LoRA's tiny trainable parameter count, the warmup is the difference between stable and unstable early gradients.
- Default learning rate inherited from full-parameter training. The default of 0.02 was tuned for full NanoGPT training. For LoRA adapters the magnitude is wrong by roughly an order of magnitude; we recommend starting at 0.005 and sweeping in [0.002, 0.005, 0.01, 0.02].
- Embedding/LM-head not excluded. The split function assigns parameters to Muon by the heuristic
ndim == 2, which incorrectly pulls inembed_tokensandlm_head. Under QLoRA both are frozen, so the bug is silent in this run, but the guard belongs in the code regardless.
# Patched parameter split — embeddings and lm_head explicitly excluded def split_params_for_muon(model): muon_params, adamw_params = [], [] for name, p in model.named_parameters(): if not p.requires_grad: continue if p.ndim == 2 and not any(x in name for x in ("embed_tokens", "lm_head")): muon_params.append(p) else: adamw_params.append(p) return muon_params, adamw_params
6.2 Status of the SFT run
The training pipeline is built, the data-collection scripts are wired, and the audit corrections have been applied. The bottleneck at the time of writing is the size of the trajectory dataset itself: 16 clean trajectories from 52 candidate runs is well below the 500-1 000 minimum we target before the SFT vs. Muon ablation can produce a meaningful comparison. The next step — and the work this paper hands off to its successor — is a sustained generation run across a wider grid of NIAH and Long-Document-QA tasks to grow the trajectory pool, followed by paired AdamW and Muon training under matched seeds.
§7 Limitations & Failure Modes
This study is small in every dimension that matters for a strong empirical claim. Three caveats worth naming:
Single-model, single-hardware result
Everything reported here uses one model family on one GPU. The crossover point — the context length at which RLM begins to win — is a function of both. A bigger model with a larger native KV cache would shift the crossover rightward; a slower one would shift it leftward. The direction of the result reproduces the original paper. The quantitative threshold is local.
Code-quality dependence at the lower bound
The 16 K dip is real and is the most informative failure mode in the dataset. A 2 B model is small enough that, on borderline-length problems, its REPL code occasionally has bugs the loop cannot recover from. A larger root model would not have this problem. SFT on collected trajectories should help; we have not yet shown that it does.
Latency
An RLM run on the Sherlock document takes ~60 seconds; the vanilla baseline takes ~15. The sub-calls are sequential in our implementation. Asynchronous parallel sub-calls — where independent llm_query() invocations dispatch concurrently — are the obvious next implementation step and are likely to compress total wall-clock by 2–3×.
§8 Future Work
Three directions follow naturally from this work, in increasing order of ambition:
- Async parallel sub-calls. Replace the sequential
llm_query()dispatch with an async pool. Bounded concurrency is enough — the GPU saturates fast — but throughput gains should be substantial on multi-call iterations. - Adaptive recursion depth. Let the root model decide whether a sub-call's answer is itself worth recursing on, instead of treating the recursion depth as fixed at 1. The added control flow is small; the question is whether the 2 B model can wield it without runaway loops.
- RLM-aware SFT, end to end. Once the trajectory pool reaches the 500-example threshold, run the AdamW vs. Muon ablation, merge the best LoRA adapter back into the base, and re-run the long-context benchmark. The hypothesis to falsify: an RLM-aware fine-tune narrows the 16 K gap and increases the 64 K margin without changing the inference architecture.
§9 Conclusion
The Recursive Language Model is best understood not as a new model but as a new interface between a language model and a long input. The interface — a Python REPL with a context string and an llm_query() hook — moves the work of finding evidence out of attention and into deterministic code, then folds the work of reading evidence into focused sub-calls that never grow large enough to suffer from context rot. The architectural claim is hardware-independent: it costs little to verify, costs nothing to run, and turns out to do exactly what its authors said it would. A 2 B model on a 6 GB laptop GPU finds needles in 64 K-token haystacks that a same-weights vanilla baseline cannot see at all. That is the entire claim, and it reproduces.
§ References & Further Reading
- Zhang, A., Kraska, T., & Khattab, O. (2025). Recursive Language Models. arXiv:2512.24601. The original paper this work reproduces.
- Zhang, A. (2025). RLM blog post. alexzhang13.github.io/blog/2025/rlm. Author's accessible companion write-up.
- Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. The U-curve result.
- Bertsch, A. et al. (2025). OOLONG: A Long-Context Aggregation Benchmark. arXiv:2511.02817.
- Jordan, K. (2024). Muon: Momentum-Orthogonalised Updates for 2-D Parameters. github.com/KellerJordan/Muon.
- Patel, P. (2026). RLM Reproduction — implementation, benchmarks, training audit. github.com/priyank766/RLM. The repository accompanying this paper.
All code, prompts, and benchmark configurations are released under the project repository. Trajectory data and training checkpoints are available on request.