Modern Large Language Models (LLMs) feature increasingly large context windows, with some models like Gemini 2.5 Pro capable of processing up to 1 million tokens.
While this seems like an advantage, simply filling the context window with as much information as possible is actually a bad practice. This creates context bloat, which can lead to worse performance and higher costs.
More recently, people began to call the practice of managing LLM context context engineering. Andrej Karpathy (ex-OpenAI researcher) calls it "the delicate art and science of filling the context window with just the right information."
Performance Degrades with More Context
It is important to understand that as the context window grows, the model's performance starts to degrade.
A study on long-context evaluation called NoLiMa found that for many popular LLMs, "performance degrades significantly as context length increases." The NoLiMa benchmark showed that at 32k tokens, 11 out of 12 tested models dropped below 50% of their performance in short contexts.
This happens because the attention mechanism can struggle to find the most relevant information within a sea of text.
More recent results from the Fiction.liveBench benchmark also show this trend. As context size grows, even top models see a drop in their ability to recall and reason about the information provided.
Here's an excerpt from the results from Fiction.liveBench on popular new models:
Model | Performance at 32k | Performance at 120k | Performance at 192k |
---|---|---|---|
Gemini 2.5 Pro | 91.7 | 87.5 | 90.6 |
GPT-5 | 97.2 | 96.9 | 87.5 |
DeepSeek v3.1 | 50.0 | 53.1 | - |
Claude Sonnet 4 | 44.4 | 36.4 | - |
The strongest models for long-context tasks like Gemini 2.5 Pro and GPT-5 see their performance drop to around 90% at 192k tokens. Other models like DeepSeek V3.1 and Sonnet 4 is much worse.
For the best results, you should aim to operate within a model's effective context length, which is the point before its performance starts to drop significantly. For most new top models, this is around 120k tokens to 200k tokens.
The Increasing Cost of Large Context
Besides performance issues, overloading the context also increases costs. LLMs are stateless, which means they do not have memory of past conversations. For every message you send, the entire conversation history must be sent back to the model.
Here's an nice illustration by OpenAI of how the context window grows with each turn in a conversation:
As the conversation gets longer, the number of tokens sent as input to the model grows for each new message.
Since API usage is priced per token, longer contexts directly translate to higher costs. Keeping the context concise and relevant is therefore not just good for performance, but also for your budget.
What Causes Context Bloat
Context can get filled up for several reasons, often without the user realizing it. For coding tasks, one common cause for context bloat is including irrelevant rules or instructions. Including instructions for a backend task when working on the frontend can confuse the model, as explained by a developer from Cline.
Having too many tools from MCP servers can also quickly increase context size. For example, the popular MCP server Playwright contains 21 tools, each with its own definition and instructions.
With the help of newly released /context
command in Claude Code, we can see that a single Playwright MCP server can consume over 11.7k tokens, while the built-in tools consume 11.6k tokens.
Another simple cause of context bloat is reusing the same chat session for multiple, unrelated tasks. This clutters the conversation history with irrelevant information from previous tasks. The model must then sift through this old context, which can negatively affect its focus on the current problem.
How to Manage Context Effectively
The key to managing context is understanding what is in it. In tools like Claude Code, you can use the /context
command to see a breakdown of your current token usage of the context window.
If you see memory files taking up too many tokens, consider removing irrelevant rules from your CLAUDE.md
or AGENTS.md
file. Keep it concise and relevant.
Some tools like Claude Code supports directory-level rules or memory files. For example, if your repository has frontend code inside frontend
directory, you can place CLAUDE.md
specific for frontend tasks in a frontend
directory. This will only be loaded when the model is working on frontend tasks in that directory.
If you see tools from MCP servers taking up too many tokens, be selective about which MCP servers you enable. If a MCP server is not needed for your current task, consider disabling it to free up context space.
Another important tip is to start a new session for each new task, clearing the current context window. This ensures the context contains only relevant information, preventing the model from getting confused by previous, unrelated conversations.
Tools like 16x Eval are designed to help developers and researchers run systematic evaluations of LLMs. You can use it to test how different models perform with different prompts, or varying context lengths.
This allows you to understand a model's effective context length and refine your context engineering strategy for your specific needs.