Jun 6, 2026 · 11 min read AI Platform Engineering Architecture

Context engineering: the window is a budget, not a bucket

The context window is your agent's working memory, not a junk drawer. Four operations — write, select, compress, isolate — and a token budget you allocate.

Prompt engineering asks what should I say to the model. Context engineering asks the harder question: of everything I could put in front of the model right now, what earns a place in the window?

That second question is where almost every production agent quietly succeeds or fails. I’ve now spent a year building agentic infrastructure for a safety-critical platform, and the pattern is consistent: the model is rarely the bottleneck, the wording is rarely the bottleneck — the assembly of context is. Two agents running the identical model and the identical system prompt will behave like different products if one of them packs its window well and the other dumps everything it can find into it.

I framed this once before as the middle phase of AI maturity in harness engineering:

2022–2023: prompt engineering — get the wording right.
2024–2025: context engineering — get the right information into the window.
2026: harness engineering — engineer the whole environment the model runs in.

That post was about the outer loop. This one zooms into the part I waved at — the context window itself — because it’s the component teams most consistently underinvest in, and the one with the worst failure modes when they do.

The window is RAM, not a hard drive

Here’s the mental model that fixed this for me, and it’s one any infrastructure person already owns.

The context window is RAM. It is small, fast, expensive, and volatile. Everything the model “knows” in this turn — the system prompt, the conversation, retrieved documents, tool results, memory — has to be resident in that RAM at inference time. Nothing else exists to the model. The vector store, the wiki, the database, the previous conversation: all of that is disk. It is only real to the model the instant you page it into the window.

Context engineering is memory management. You have a fixed budget of fast memory, an effectively unbounded backing store, and a workload that needs the right pages resident at the right moment. The whole discipline is paging strategy: what to load, when to load it, what to evict, and how to keep the working set small enough to be fast and cheap without thrashing.

Once you see the window as a managed budget rather than a bucket you fill, the rest of this post is just the policies.

Four operations

The community has more or less converged on four things you can do with context. I’ll use the now-common framing — write, select, compress, isolate — because it’s a good carve-up, but the treatment here is the one I’ve landed on in production.

1. Write — get it out of the window

The first instinct of most agents is to keep everything in the conversation. That’s the equivalent of never writing to disk and wondering why you ran out of RAM.

Write means externalizing state the model doesn’t need this turn so it can pull it back when it does:

A scratchpad — the agent writes its plan, intermediate findings, and decisions to a structured note it can re-read, instead of carrying the whole reasoning trace in-window.
Memory — facts that should survive across turns or sessions (the cluster’s topology, a user’s preferences, what was already ruled out) live in a store, not in the transcript.

The seam that matters: the agent depends on write(fact) and read(query), not on the transcript. This is the same instinct as the compacting memory in my SRE harness — working memory holds the task plus observations, and everything else is paged out.

2. Select — page in only what this turn needs

Selection is retrieval, and it’s where most of the leverage and most of the foot-guns live. The naive version — embed the query, pull the top 50 chunks, paste them in — is the single most common way I see context budgets blown.

Good selection is just-in-time and ranked:

Retrieve a candidate set, then rerank and keep the few that actually earn their tokens. Top-5 reranked beats top-50 raw on both quality and cost, every time I’ve measured it.
Prefer fetching tool results when needed over preloading them speculatively. An agent that can call get_logs doesn’t need yesterday’s logs sitting in its window on the chance it asks.
Select tools too, not just documents. If your agent has forty tools, don’t advertise forty — the model’s accuracy at picking one degrades as the menu grows. This is the catalog discipline from the MCP gateway pattern applied to context: a curated, governed list beats raw enumeration.

If you’re doing retrieval at all, the RAG fundamentals and vector embeddings posts are the prerequisites; context engineering is what you do after retrieval returns more than you can afford to use.

3. Compress — make the resident set smaller

When the working set genuinely has to stay large — a long incident thread, a multi-hour agent run — you compress rather than evict.

Summarize completed sub-tasks into a few lines of outcome and carry the summary, not the transcript. My incident-scribe tool is this operation as a product: a messy Slack thread compressed into a structured report.
Compact on a trigger — when the window crosses some fraction of the limit, roll the oldest turns into a running summary. The key discipline is that compaction is lossy on purpose, so you decide what’s load-bearing (decisions, ruled-out hypotheses, current hypothesis) and what’s disposable (the play-by-play that got you there).
Don’t compress what you can just evict. A tool result you’ve already acted on can often leave the window entirely, with a one-line pointer left behind.

4. Isolate — give each job its own window

The most underused operation. Instead of one agent with one ever-growing context, split the work across isolated contexts — sub-agents, each with a clean window scoped to one job, reporting back only a result.

A triage agent spawns a “read the logs and summarize the error” sub-agent. That sub-agent burns whatever tokens it needs on raw log lines in its own window, and returns three sentences. The parent never sees the log spew — only the conclusion. Isolation is how you keep the orchestrator’s window clean while still doing token-heavy work underneath it.

The cost is coordination and the risk is context fragmentation — sub-agents that each lack a fact the others have. Use it when the sub-task is genuinely separable and its intermediate state is noise to the parent.

Allocate the budget like you mean it

Here’s the part teams skip: actually writing down the budget. If your window is 200K tokens, that is not 200K tokens of freedom — it’s a budget with line items, and the useful exercise is filling in the table before you’re debugging why the agent went sideways at turn 30.

A real allocation for an incident-triage agent might look like:

Segment	Budget	Policy
System prompt + tool catalog	~4K	Static. Cache it.
Task + current hypothesis	~1K	Always resident.
Retrieved runbook / docs	~8K	Top-5 reranked, never raw top-N.
Tool results (this turn)	~10K	Just-in-time; evicted after acted on.
Working memory / observations	~6K	Compacted past a threshold.
Headroom for the model’s reply	~8K	Reserved. Don’t spend it.

The exact numbers matter less than the act of allocating. A segment without a budget is a segment that will, under load, expand to consume the whole window — and the thing it starves is usually the model’s own room to think.

Two infrastructure levers make the budget cheaper to hold:

Prompt caching. The static head of your window — system prompt, tool catalog, stable instructions — should be cached so you’re not re-paying for it every turn. This is pure context engineering at the billing layer; see prompt caching.
Ordering for cache hits. Put the stable stuff first and the volatile stuff last, so the cacheable prefix is as long as possible.

The failure modes that show up at scale

These are the symptoms that send teams back to this post. None of them are model failures. All of them are context failures.

Context rot / lost-in-the-middle. Past a certain fill, models attend worse to the middle of a long window. The fact that decides the answer is in there — the model just glides over it. Symptom: the agent “ignores” something you can see is right there. Fix: less, and reordered, not more.
Distraction. Irrelevant-but-plausible context pulls the model off-task. Fifty retrieved chunks where five were relevant don’t just waste tokens — the other forty-five actively degrade the answer. Fix: rerank hard, keep few.
Poisoning. A wrong fact enters the context — a bad retrieval, a hallucinated tool result echoed back in — and then persists, compounding every subsequent turn because the model treats its own context as ground truth. Fix: provenance on context (where did this come from), and a verification pass before acting. This is exactly why my harness makes evidence a structural requirement, not a hope.
Clash. Two retrieved chunks contradict each other and the model picks arbitrarily. Fix: dedup and reconcile at selection time, before it reaches the window.
Cost and latency blowup. A window that grows unbounded turns every turn slower and more expensive — and the user feels both. Fix: the budget table above, enforced.

The through-line: a bigger window is not a fix for any of these. It’s an enabler for all of them. The discipline scales down, not up.

A worked example: triage in one window

Concretely, here’s the window for an SRE agent triaging a CrashLoopBackOff, after context engineering:

Cached head: system prompt + the seven tools this run is allowed (not all forty). ~4K, paid once.
Task: “Pod checkout-7f9 is crash-looping in prod. Find the cause.” Plus the current hypothesis, updated in place each turn — not appended.
Selected context: the one runbook section that matched CrashLoopBackOff, reranked out of a dozen candidates. The other eleven never enter the window.
Just-in-time tool results: get_pod_metrics returns memory at 97% with three OOM kills. That stays resident because it’s load-bearing. The earlier list_pods output, already acted on, is evicted to a one-line pointer.
Isolated sub-work: reading 2,000 lines of raw logs happens in a sub-agent’s window. The parent receives "OutOfMemoryError: Java heap space against a 512Mi limit" — three sentences, not the spew.
Compaction: by turn 12 the early investigation is rolled into “ruled out: networking, image pull, config. Current: undersized heap.” The play-by-play is gone; the decisions survive.

Same model, same tools as a naive agent that would have pasted all forty tools, twelve runbook chunks, and 2,000 log lines into one window and then “mysteriously” lost the thread around the memory metric. The difference is entirely paging strategy.

Try it

I packaged the four operations and the budget into a small, runnable reference: context-assembler. It’s offline, dependency-free, and Apache-2.0 — typed segments, an enforced token budget, a reranker, and compaction, in code you can read in a sitting.

git clone https://github.com/ajinb/context-assembler.git
cd context-assembler
pip install -e ".[dev]"
python examples/triage_demo.py

The demo assembles the exact triage above two ways and prints the allocation table for each. The naive window dumps forty tools, twelve raw runbook chunks, and 2,000 log lines, runs at 93%, and truncates the logs — silently dropping the OutOfMemoryError line that sits at the tail and explains the crash:

  logs   get_logs-raw   3500   3500   truncated
  → load-bearing evidence ('OutOfMemoryError'): ✗ MISSING in the assembled window.

The engineered window — reranked to one runbook section, the log spew isolated into a three-sentence sub-agent summary, observations compacted — fits at 5% with the evidence resident:

  → load-bearing evidence ('OutOfMemoryError'): ✓ resident in the assembled window.

Same model, same facts. The only variable is how the window was packed. Swap the deterministic token counter for a real tokenizer and the lexical reranker for a cross-encoder; the shapes don’t change. It’s a reference to read and outgrow, not a framework to depend on.

What I’d build first

Standing this up at a new org, in order:

A context assembler with a budget table. One function that builds the window from typed segments, each with a token cap. Before anything clever, just make the budget visible and enforced. You can’t manage what you don’t measure.
Reranked selection. Put a reranker between retrieval and the window. This is the single highest-leverage change for most RAG-backed agents.
Prompt caching on the static head. Free latency and cost, no quality cost. Order the window for it.
Compaction on a trigger. Before you have long runs, decide what’s load-bearing so the summary keeps the right things.
Context provenance. Tag each segment with where it came from. The day you debug a poisoning bug, you’ll want it — same reason the MCP gateway keeps a provenance ledger.

That order keeps the agent shippable at every step and avoids the dead-end where you build an elaborate retrieval pipeline and never look at what actually lands in the window.

The window is a designed artifact

The reason context engineering sits between prompt and harness engineering isn’t chronology — it’s altitude. Prompt engineering is the words. Harness engineering is the environment. Context engineering is the thing in the middle that both depend on: the working set the model reasons over, assembled fresh every turn.

Treat that working set the way you’d treat memory in any system you’re proud of — budgeted, observable, with a clear eviction policy and provenance on every page. Do that and the same model you already have starts behaving like a better one. Skip it, and no amount of prompt-tuning or model upgrades will save an agent that’s drowning in its own context.

The window isn’t a bucket you fill. It’s a budget you allocate. Spend it like it’s scarce, because it is.

Ajin Baby is an AI Platform & Cloud Infrastructure Architect at a Fortune-500 aviation technology company and the founder of cloudandsre.com, where he publishes production-grade tooling at the intersection of AI and SRE. He is currently pursuing an MS in Artificial Intelligence and is a 15+ year IEEE member.