Managing the Context Window

Practical strategies for managing the context window in AI conversations—prioritize key information, summarize long threads, and keep prompts focused.

AI AGENTS

May 14, 2026

2 min

Michel Tricot

Summarize with AI:

Agents fail because they can't decide what to ignore.

Every week I talk to founders building agents and a common pattern shows up early. They connect Slack, Notion, HubSpot, Jira, maybe an internal database. Each new connection adds more context to the window. The tool surface expands, the context gets richer, and the agents often get worse.

Users have longer conversations, more tools get added, the system prompt grows, and retrieved documents pile up. Then the agent starts hallucinating, missing instructions, and giving confidently wrong answers.

Why Do Agents Get Worse in Production?

Picture a typical agent call. The system prompt takes 2,000 tokens. Tool definitions take another 1,500. Conversation history runs to 15,000. Retrieved documents add 10,000 more. The actual user query is maybe 200 tokens, sitting in the middle of a 30,000-token window, surrounded by stale outputs and verbose JSON full of fields the agent will never use.

The tokens that matter get buried. In founder conversations, a version of the same complaint keeps coming up. The answer was there, but the agent still missed it.

Context Management Is the Bottleneck

Every token has a cost. Financial, computational, and cognitive. Every irrelevant token makes it harder for the model to reason over the relevant ones.

The deeper constraint is attention inside the window. Raw token capacity doesn't solve the problem when the model can't focus on what matters. Teams keep running into a simple failure mode in production conversations. Information can be present and still be hard for the model to use when it's buried inside a long input. A critical fact deep in the window becomes functionally invisible, even when it's technically still in context.

Flooding the window doesn't just waste tokens. In the founder conversations behind this piece, it often seems to make reasoning worse. That degradation compounds when longer sequences also drive up compute cost.

Failure Modes From Unmanaged Context

Unmanaged context produces three failure modes that teams often misdiagnose as model limitations.

Applying the wrong fix makes things worse. The model isn't necessarily hallucinating in any of these cases. Often it's faithfully reasoning over bad inputs or an exhausted budget.

Separating Storage From Presentation

The default pattern in many agent architectures is append and pray. Every tool output gets appended to the conversation, every retrieved document goes into the window, and the full history stays in place. This is like bringing your entire filing cabinet to every meeting, carrying every document you've ever filed and hoping participants can find the relevant page. The filing cabinet should stay in the office. Bring a brief.

This is why the context window is just one layer of context engineering. The window is the presentation layer. Upstream work like data acquisition, shaping, freshness, permissions, and retrieval determines whether the model sees anything useful in the first place.

Build a Per-Call Compilation Step

The fix is separating what the system stores from what the model sees. The system maintains a durable state, including full conversation history, retrieved documents, and complete tool outputs. But the model never needs to see all of it. Before each call, a compilation step assembles a purpose-built context window from that stored state.

It selects what's relevant for this specific operation, compresses what can be summarized, and prioritizes by position, placing critical information at the beginning and end of the window. Everything that doesn't earn its place gets cut.

This separation makes context management workable. You can keep complete history for auditability without forcing the model to ingest it on every call. The compilation step also becomes observable and testable. You can inspect exactly what went into the window for each call, debug failures by looking at the compiled context, and iterate on individual processors without rewriting the whole pipeline. That observability is what turns context management from guesswork into engineering.

How Should You Structure Data Before It Reaches the Window?

All the compression and curation in the world can't fix a problem that starts upstream. Most context entering the window was structured for other consumers, not agents. Raw API responses come with nested JSON, pagination metadata, and dozens of fields the agent doesn't need. Tool outputs are formatted for human debugging. Retrieved documents are often chunked by character count instead of semantic boundary.

Agent-aware data infrastructure pre-structures context before it enters the window. Responses include only the fields an agent needs. Tool outputs are shaped for reasoning rather than debugging. Retrieved context is ranked for the task, with freshness and permissions handled upstream. In practice, in many founder conversations, structuring data before it reaches the window comes up as a higher-leverage fix than compressing it after the fact.

Treating Context Spend Like a Budget

Most teams can't tell you how many tokens their system prompt consumes, what percentage of context is conversation history versus task-relevant information, or when their agent starts crowding out the task itself. Context rarely gets measured as a first-class budget, and the cost stays hidden until the agent starts failing.

Here’s how to build the budget in five steps:

Profile actual usage. Instrument a representative sample of agent calls and measure where tokens are actually being spent, including system prompt, tool definitions, conversation history, retrieved documents, and tool outputs. Many teams are surprised by what dominates. Tool definitions and conversation history alone can consume most of the budget before any task-specific context enters the window.
Set explicit allocations. Based on profiling, assign a specific token budget to each component. System prompt, tool definitions, retrieved context, conversation history. The exact numbers will vary. Setting them is what matters.
Assign owners. Every allocation needs someone accountable. Without ownership, budgets drift, and in context engineering, drift means silent degradation.
Automate compression triggers. At what token count does conversation history get summarized? When do older tool outputs get pruned? These triggers should fire automatically, not depend on someone remembering.
Make observability continuous. Track token spend per component per call. Alert when allocations are exceeded. Review the budget periodically as the agent's capabilities evolve.

We're still figuring out what good context quality looks like in production. But one thing already seems clear from the founder patterns behind this piece. Teams that treat context as a managed budget tend to catch degradation before users do. Teams that don't find out from support tickets.

The Question That Changes How You Build Agents

The teams that build reliable agents focus on what should stay out of the context window, and who decides.

The context window is a working surface where every token has a financial cost, a computational cost, and a cognitive cost. Treating it like a budget, with allocations, owners, accountability, and the willingness to cut, is what separates agents that demo well from agents that hold up in production. In many of the production patterns I hear about, cheaper, faster, and more accurate all follow from managing the budget. The teams building that discipline now won't just ship better agents. They'll be operating with a mature practice while the rest of the industry is still debugging context windows by trial and error.

Subscribe to Agent Blueprint to learn more about agentic data infrastructure.