Handling extremely long documents remains a persistent challenge for large language models (LLMs). Even with techniques such as length extrapolation and sparse attention, models often suffer from performance degradation and high computational costs. To address this, researchers from ByteDance Seed and Tsinghua University introduce MemAgent, a reinforcement learning-based memory agent designed to enable long-context processing with linear complexity and minimal performance loss.
Limitations of Existing Approaches
Current solutions for long-context modeling fall into three main categories:
- Length Extrapolation Methods (e.g., NTK, PI, YaRN, DCA): Extend the context window via positional embedding manipulations. However, they often face performance degradation and scaling issues.
- Sparse and Linear Attention Mechanisms: Reduce attention complexity to O(n) but typically require retraining from scratch and rely on fixed patterns or human-defined rules.
- Context Compression: Use token-level or external memory modules to condense long inputs but often disrupt standard generation and struggle with extrapolation.
These approaches fail to deliver all three critical attributes: arbitrary input length support, consistent accuracy, and efficient linear complexity.
MemAgent: Human-Like Memory Strategy
Inspired by how humans summarize key information while ignoring noise, MemAgent processes input as a stream of evidence. At each step, it reads a document chunk and an internal memory, overwriting the latter with updated, compressed context.
Key innovations:
- Fixed-Length Token-Based Memory: Compresses essential information while maintaining model compatibility.
- Segment-Wise Overwrite Mechanism: Supports infinite text lengths without growing memory.
- Linear Complexity: Memory update and decoding cost remain constant per chunk.

Multi-Conv RL Training with GRPO
MemAgent treats each document chunk interaction as an independent dialogue. It is trained via Group Relative Policy Optimization (GRPO) within a multi-conversation RL pipeline called DAPO, enabling reward-driven memory update.
Key elements include:
- Rule-Based Verifier: Calculates outcome rewards by comparing model answers with multiple ground truths.
- Token-Level RL Signal: Applied uniformly across conversations stemming from a sample.
This setup encourages memory compression focused on answer-relevant information and discards distractors.
Performance Evaluation
Using the RULER benchmark and synthetic datasets from HotpotQA and SQuAD, MemAgent was trained with an 8K context window and extrapolated up to 3.5 million tokens.
Model | 224K | 896K | 3.5M |
---|---|---|---|
Qwen2.5-Instruct-14B-1M | 37.5% | 0.0% | N/A |
QwenLong-L1-32B | 17.2% | 11.7% | N/A |
RL-MemAgent-14B | 81.3% | 77.3% | 78.1% |
MemAgent maintained over 95% accuracy on RULER benchmarks (8K to 512K tokens) and consistently outperformed long-context and distillation-based baselines.

Case Study: Multi-Hop QA
Given the query “The director of the romantic comedy ‘Big Stone Gap’ is based in what New York city?”, MemAgent progressively tracked relevant content across 3 chunks:
- Recognized unrelated content but retained location information.
- Maintained memory against irrelevant chunks.
- Correctly updated memory upon encountering Adriana Trigiani’s biography.
Final answer: Greenwich Village, New York City.
Theoretical Foundation and Complexity
MemAgent reformulates the autoregressive model using latent memory variables (m₁…mₖ):
p(x₁:N) = ∑ₘ₁:ₖ ∏ₖ p(cₖ | mₖ₋₁) * p(mₖ | cₖ, mₖ₋₁)
This enables O(N) compute cost and human-readable intermediate memory—unlike attention-based feature compression. RL is essential, as memory updates are discrete and can’t be learned via backpropagation.
Conclusion
MemAgent offers a scalable and efficient solution to the long-context trilemma: unlimited input length, near-lossless accuracy, and linear complexity. Its RL-based overwrite memory mechanism allows LLMs to read, abstract, and generate over multi-million-token inputs without architectural modification.
FAQs
Q1: What is MemAgent?
MemAgent is a reinforcement learning-based framework that equips LLMs with memory tokens to handle extremely long contexts efficiently.
Q2: How is it different from attention or extrapolation methods?
Unlike attention-based scaling or extrapolation techniques, MemAgent uses token-based memory updated via reinforcement learning.
Q3: What models can MemAgent be applied to?
Any Transformer-based LLM. No changes to the model architecture are required.
Q4: How does it scale with input size?
It maintains linear computational complexity regardless of input length by fixing the memory size.
Q5: What are the applications of MemAgent?
Long-document QA, agent memory systems, legal document review, scientific literature analysis, and real-time decision-making with large evidence bases.
Check out the Paper. All credit for this research goes to the researchers of this project.
Sponsorship Opportunity: Reach the most influential AI developers in US and Europe. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]
The post MemAgent: A Reinforcement Learning Framework Redefining Long-Context Processing in LLMs appeared first on MarkTechPost.