Logo16x Eval

LLM Context Management: How to Improve Performance and Lower Costs

Posted on August 25, 2025 by Zhu Liang

Modern Large Language Models (LLMs) feature increasingly large context windows, with some models like Gemini 2.5 Pro capable of processing up to 1 million tokens.

While this seems like an advantage, simply filling the context window with as much information as possible is actually a bad practice. This creates context bloat, which can lead to worse performance and higher costs.

More recently, people began to call the practice of managing LLM context context engineering. Andrej Karpathy (ex-OpenAI researcher) calls it "the delicate art and science of filling the context window with just the right information."

Tweet by Andrej Karpathy, ex-OpenAI researcher on context engineering
Tweet by Andrej Karpathy, ex-OpenAI researcher on context engineering

Performance Degrades with More Context

It is important to understand that as the context window grows, the model's performance starts to degrade.

A study on long-context evaluation called NoLiMa found that for many popular LLMs, "performance degrades significantly as context length increases." The NoLiMa benchmark showed that at 32k tokens, 11 out of 12 tested models dropped below 50% of their performance in short contexts.

NoLiMa benchmark results show performance degradation for popular models at longer context lengths.
NoLiMa benchmark results show performance degradation for popular models at longer context lengths.

This happens because the attention mechanism can struggle to find the most relevant information within a sea of text.

More recent results from the Fiction.liveBench benchmark also show this trend. As context size grows, even top models see a drop in their ability to recall and reason about the information provided.

Fiction.liveBench results show performance degradation for popular models at longer context lengths.
Fiction.liveBench results show performance degradation for popular models at longer context lengths.

Here's an excerpt from the results from Fiction.liveBench on popular new models:

ModelPerformance at 32kPerformance at 120kPerformance at 192k
Gemini 2.5 Pro91.787.590.6
GPT-597.296.987.5
DeepSeek v3.150.053.1-
Claude Sonnet 444.436.4-

The strongest models for long-context tasks like Gemini 2.5 Pro and GPT-5 see their performance drop to around 90% at 192k tokens. Other models like DeepSeek V3.1 and Sonnet 4 is much worse.

For the best results, you should aim to operate within a model's effective context length, which is the point before its performance starts to drop significantly. For most new top models, this is around 120k tokens to 200k tokens.

The Increasing Cost of Large Context

Besides performance issues, overloading the context also increases costs. LLMs are stateless, which means they do not have memory of past conversations. For every message you send, the entire conversation history must be sent back to the model.

Here's an nice illustration by OpenAI of how the context window grows with each turn in a conversation:

An illustration of how the context window grows with each turn in a conversation. Source: OpenAI
An illustration of how the context window grows with each turn in a conversation. Source: OpenAI

As the conversation gets longer, the number of tokens sent as input to the model grows for each new message.

Since API usage is priced per token, longer contexts directly translate to higher costs. Keeping the context concise and relevant is therefore not just good for performance, but also for your budget.

What Causes Context Bloat

Context can get filled up for several reasons, often without the user realizing it. For coding tasks, one common cause for context bloat is including irrelevant rules or instructions. Including instructions for a backend task when working on the frontend can confuse the model, as explained by a developer from Cline.

Having too many tools from MCP servers can also quickly increase context size. For example, the popular MCP server Playwright contains 21 tools, each with its own definition and instructions.

With the help of newly released /context command in Claude Code, we can see that a single Playwright MCP server can consume over 11.7k tokens, while the built-in tools consume 11.6k tokens.

The `/context` command in Claude Code shows that Playwright MCP tools take up 11.7k tokens from the context.
The `/context` command in Claude Code shows that Playwright MCP tools take up 11.7k tokens from the context.

Another simple cause of context bloat is reusing the same chat session for multiple, unrelated tasks. This clutters the conversation history with irrelevant information from previous tasks. The model must then sift through this old context, which can negatively affect its focus on the current problem.

How to Manage Context Effectively

The key to managing context is understanding what is in it. In tools like Claude Code, you can use the /context command to see a breakdown of your current token usage of the context window.

If you see memory files taking up too many tokens, consider removing irrelevant rules from your CLAUDE.md or AGENTS.md file. Keep it concise and relevant.

Some tools like Claude Code supports directory-level rules or memory files. For example, if your repository has frontend code inside frontend directory, you can place CLAUDE.md specific for frontend tasks in a frontend directory. This will only be loaded when the model is working on frontend tasks in that directory.

If you see tools from MCP servers taking up too many tokens, be selective about which MCP servers you enable. If a MCP server is not needed for your current task, consider disabling it to free up context space.

Example of good context management in Claude Code, using only 17k tokens at the start of the conversation.
Example of good context management in Claude Code, using only 17k tokens at the start of the conversation.

Another important tip is to start a new session for each new task, clearing the current context window. This ensures the context contains only relevant information, preventing the model from getting confused by previous, unrelated conversations.


Tools like 16x Eval are designed to help developers and researchers run systematic evaluations of LLMs. You can use it to test how different models perform with different prompts, or varying context lengths.

Screenshot of a sample evaluation from 16x Eval
Screenshot of a sample evaluation from 16x Eval

This allows you to understand a model's effective context length and refine your context engineering strategy for your specific needs.

16x Eval blog is the authoritative source for AI model evaluations, benchmarks and analysis for coding, writing, and image analysis. Our blog provides leading industry insights and is cited by top publications for AI model performance comparisons.

Download 16x Eval

No sign-up or login required. Create your own evals in minutes.