How xMemory cuts token costs and context bloat in AI agents

Standard RAG pipelines break when enterprises try to use them for long-term, multi-session LLM agent deployments. This is a critical limitation as demand for persistent AI assistants grows.xMemory, a new technique developed by researchers at King’s College London and The Alan Turing Institute, solves this by organizing conversations into a searchable hierarchy of semantic themes.Experiments show that xMemory improves answer quality and long-range reasoning across various LLMs while cutting inference costs. According to the researchers, it drops token usage from over 9,000 to roughly 4,700 tokens per query compared to existing systems on some tasks.For real-world enterprise applications like personalized AI assistants and multi-session decision support tools, this means organizations can deploy more reliable, context-aware agents capable of maintaining coherent long-term memory without blowing up computational expenses.RAG wasn't built for thisIn many enterprise LLM applications, a cri

Mar 25, 2026 0 0

How xMemory cuts token costs and context bloat in AI agents

Standard RAG pipelines break when enterprises try to use them for long-term, multi-session LLM agent deployments. This is a critical limitation as demand for persistent AI assistants grows.

xMemory, a new technique developed by researchers at King’s College London and The Alan Turing Institute, solves this by organizing conversations into a searchable hierarchy of semantic themes.

Experiments show that xMemory improves answer quality and long-range reasoning across various LLMs while cutting inference costs. According to the researchers, it drops token usage from over 9,000 to roughly 4,700 tokens per query compared to existing systems on some tasks.

For real-world enterprise applications like personalized AI assistants and multi-session decision support tools, this means organizations can deploy more reliable, context-aware agents capable of maintaining coherent long-term memory without blowing up computational expenses.

RAG wasn't built for this

In many enterprise LLM applications, a critical expectation is that these systems will maintain coherence and personalization across long, multi-session interactions. To support this long-term reasoning, one common approach is to use standard RAG: store past dialogues and events, retrieve a fixed number of top matches based on embedding similarity, and concatenate them into a context window to generate answers.

However, traditional RAG is built for large databases where the retrieved documents are highly diverse. The main challenge is filtering out entirely irrelevant information. An AI agent's memory, by contrast, is a bounded and continuous stream of conversation, meaning the stored data chunks are highly correlated and frequently contain near-duplicates.

To understand why simply increasing the context window doesn’t work, consider how standard RAG handles a concept like citrus fruit.

Imagine a user has had many conversations saying things like “I love oranges,” “I like mandarins,” and separately, other conversations about what counts as a citrus fruit. Traditional RAG may treat all of these as semantically close and keep retrieving similar “citrus-like” snippets.

“If retrieval collapses onto whichever cluster is densest in embedding space, the agent may get many highly similar passages about preference, while missing the category facts needed to answer the actual query,” Lin Gui, co-author of the paper, told VentureBeat.

A common fix for engineering teams is to apply post-retrieval pruning or compression to filter out the noise. These methods assume that the retrieved passages are highly diverse and that irrelevant noise patterns can be cleanly separated from useful facts.

This approach falls short in conversational agent memory because human dialogue is “temporally entangled,” the researchers write. Conversational memory relies heavily on co-references, ellipsis, and strict timeline dependencies. Because of this interconnectedness, traditional pruning tools often accidentally delete important bits of a conversation, leaving the AI without vital context needed to reason accurately.

Why the fix most teams reach for makes things worse

To overcome these limitations, the researchers propose a shift in how agent memory is built and searched, which they describe as “decoupling to aggregation.”

Instead of matching user queries directly against raw, overlapping chat logs, the system organizes the conversation into a hierarchical structure. First it decouples the conversation stream into distinct, standalone semantic components. These individual facts are then aggregated into a higher-level structural hierarchy of themes.

When the AI needs to recall information, it searches top-down through the hierarchy, going from themes to semantics and finally to raw snippets. This approach avoids redundancy. If two dialogue snippets have similar embeddings, the system is unlikely to retrieve them together if they have been assigned to different semantic components.

For this architecture to succeed, it must balance two vital structural properties. The semantic components must be sufficiently differentiated to prevent the AI from retrieving redundant data. At the same time, the higher-level aggregations must remain semantically faithful to the original context to ensure the model can craft accurate answers.

A four-level hierarchy that shrinks the context window

The researchers developed xMemory, a framework that combines structured memory management with an adaptive, top-down search strategy.

xMemory continuously organizes the raw stream of conversation into a structured, four-level hierarchy. At the base are the raw messages, which are first summarized into contiguous blocks called “episodes.” From these episodes, the system distills reusable facts as semantics that disentangle the core, long-term knowledge from repetitive chat logs. Finally, related semantics are grouped together into high-level themes to make them easily searchable.

xMemory uses a special objective function to constantly optimize how it groups these items. This prevents categories from becoming too bloated, which slows down search, or too fragmented, which weakens the model’s ability to aggregate evidence and answer questions.

When it receives a prompt, xMemory performs a top-down retrieval across this hierarchy. It starts at the theme and semantic levels, selecting a diverse, compact set of relevant facts. This is crucial for real-world applications where user queries often require gathering descriptions across multiple topics or chaining connected facts together for complex, multi-hop reasoning.

Once it has this high-level skeleton of facts, the system controls redundancy through what the researchers call "Uncertainty Gating." It only drills down to pull the finer, raw evidence at the episode or message level if that specific detail measurably decreases the model’s uncertainty.

“Semantic similarity is a candidate-generation signal; uncertainty is a decision signal,” Gui said. “Similarity tells you what is nearby. Uncertainty tells you what is actually worth paying for in the prompt budget.” It stops expanding when it detects that adding more detail no longer helps answer the question.

What are the alternatives?

Existing agent memory systems generally fall into two structural categories: flat designs and structured designs. Both suffer from fundamental limitations.

Flat approaches such as MemGPT log raw dialogue or minimally processed traces. This captures the conversation but accumulates massive redundancy and increases retrieval costs as the history grows longer.

Structured systems such as A-MEM and MemoryOS try to solve this by organizing memories into hierarchies or graphs. However, they still rely on raw or minimally processed text as their primary retrieval unit, often pulling in extensive, bloated contexts. These systems also depend heavily on LLM-generated memory records that have strict schema constraints. If the AI deviates slightly in its formatting, it can cause memory failure.

xMemory addresses these limitations through its optimized memory construction scheme, hierarchical retrieval, and dynamic restructuring of its memory as it grows larger.

When to use xMemory

For enterprise architects, knowing when to adopt this architecture over standard RAG is critical. According to Gui, “xMemory is most compelling where the system needs to stay coherent across weeks or months of interaction.”

Customer support agents, for instance, benefit greatly from this approach because they must remember stable user preferences, past incidents, and account-specific context without repeatedly pulling up near-duplicate support tickets. Personalized coaching is another ideal use case, requiring the AI to separate enduring user traits from episodic, day-to-day details.

Conversely, if an enterprise is building an AI to chat with a repository of files, such as policy manuals or technical documentation, “a simpler RAG stack is still the better engineering choice,” Gui said. In those static, document-centric scenarios, the corpus is diverse enough that standard nearest-neighbor retrieval works perfectly well without the operational overhead of hierarchical memory.

The write tax is worth it

xMemory cuts the latency bottleneck associated with the LLM's final answer generation. In standard RAG systems, the LLM is forced to read and process a bloated context window full of redundant dialogue. Because xMemory's precise, top-down retrieval builds a much smaller, highly targeted context window, the reader LLM spends far less compute time analyzing the prompt and generating the final output.

In their experiments on long-context tasks, both open and closed models equipped with xMemory outperformed other baselines, using considerably fewer tokens while increasing task accuracy.

However, this efficient retrieval comes with an upfront cost. For an enterprise deployment, the catch with xMemory is that it trades a massive read tax for an upfront write tax. While it ultimately makes answering user queries faster and cheaper, maintaining its sophisticated architecture requires substantial background processing.

Unlike standard RAG pipelines, which cheaply dump raw text embeddings into a database, xMemory must execute multiple auxiliary LLM calls to detect conversation boundaries, summarize episodes, extract long-term semantic facts, and synthesize overarching themes.

Furthermore, xMemory’s restructuring process adds additional computational requirements as the AI must curate, link, and update its own internal filing system. To manage this operational complexity in production, teams can execute this heavy restructuring asynchronously or in micro-batches rather than synchronously blocking the user's query.

For developers eager to prototype, the xMemory code is publicly available on GitHub under an MIT license, making it viable for commercial uses. If you are trying to implement this in existing orchestration tools like LangChain, Gui advises focusing on the core innovation first: “The most important thing to build first is not a fancier retriever prompt. It is the memory decomposition layer. If you get only one thing right first, make it the indexing and decomposition logic.”

Retrieval isn't the last bottleneck

While xMemory offers a powerful solution to today's context-window limitations, it clears the path for the next generation of challenges in agentic workflows. As AI agents collaborate over longer horizons, simply finding the right information won't be enough.

“Retrieval is a bottleneck, but once retrieval improves, these systems quickly run into lifecycle management and memory governance as the next bottlenecks,” Gui said. Navigating how data should decay, handling user privacy, and maintaining shared memory across multiple agents is exactly “where I expect a lot of the next wave of work to happen,” he said.