May 15, 2026 · 8 min read AIArchitecturePlatform Engineering

What is prompt caching?

A short primer on prompt caching — the LLM-provider feature that drops the cost of a repeated long prompt by 50–90% and the latency by half. Covers how prefix matching works, the TTL economics across Anthropic / OpenAI / Google, where caching helps and where it quietly does not, and the operational gotchas that determine whether your hit rate is 90% or 9%.

Network cables in a server rack — the cache lives on the GPU, not on the wire

Photo: Taylor Vick on Unsplash.

The bill at the end of a month running an LLM-backed product almost never matches the bill the team estimated at the start. The single biggest reason is that the team modelled cost as tokens × price-per-token and forgot that, in practice, the same long preamble is being sent on every single call. Prompt caching is the provider-side feature that fixes that — and the operational discipline that decides whether it does so by 90% or by 9%.

This post is about what prompt caching actually is, how it works under the hood at the major providers, and the surprisingly small set of decisions that decide whether your hit rate is good or bad.

The shape of the bill

A typical production LLM call has three parts:

The system prompt and tool definitions. Stable across calls. Often 1,000–10,000 tokens for a real agent. Written once, evolves slowly.
The retrieved context (RAG, conversation history). Partially stable. The conversation history grows; the documents pulled by retrieval change per query but often share a corpus-level prefix.
The user’s current turn. Changes every call.

Without caching, every one of those tokens is read by the model on every call, and the provider charges for every token of input. A chatbot that ships with a 5,000-token system prompt and serves 10,000 calls a day is paying to re-read those 5,000 tokens 10,000 times — 50 million input tokens per day from the system prompt alone, before a single user has typed a word.

Prompt caching flips that math. The provider stores the model’s internal representation of a prefix of your prompt; on the next call that starts with the same prefix, the cached representation is reused instead of recomputed. You are charged a lower price per cached token, and you get the response back faster because the model skips the prefill compute on the cached portion.

How it works

The provider hashes the prompt prefix, stores the keys-and-values from the attention layers (the KV cache) keyed by that hash, and on the next call matches your prefix against the store. A match returns the cached KV; a miss recomputes and stores it for next time.

Three properties decide whether the cache is useful, and they are the same across every provider:

Prefix matching, not substring matching. The cache is matched from the beginning of the prompt. If you change the first token, the entire cache misses. Anything dynamic — a timestamp, a user ID — must go at the end of the prompt, not at the beginning.
A minimum chunk size. Providers will not cache short prefixes; the bookkeeping is more expensive than the savings. The threshold is 1,024 tokens at both Anthropic and OpenAI. Below that, caching is a no-op.
A short TTL. The cache is retained for minutes, not days — because the storage is the GPU’s HBM, not S3. The cache lives where the model lives, and that memory is expensive. Treat the TTL as a real constraint, not as a bug to be worked around.

The two flavors — explicit (Anthropic) vs automatic (OpenAI)

The mechanism is the same. The ergonomics, the discount, and the price are different enough that conflating them produces wrong cost models.

Anthropic — explicit, deeper discount, longer TTL options.

You mark a cache breakpoint with a cache_control block in the SDK. The cached prefix is stored for 5 minutes by default, with a 1-hour extended tier available. Pricing as of 2026:

5-minute cache write: 1.25× base input price (you pay 25% more on the first call to write)
1-hour cache write: 2× base input price
Cache read: 0.1× base input price (10% of normal — the deep discount)

Break-even is fast: one cache read pays back the 5-minute write; two reads pay back the 1-hour write.

OpenAI — automatic, no markers, smaller discount.

You do nothing. Prompts longer than 1,024 tokens are cached automatically; cached input is billed at 0.5× base input price (50% off). There are no breakpoints, no cache_control markers, no extended-TTL tier. The provider’s TTL is “best effort” — Microsoft Foundry documents 5–10 minutes typical, longer for high-traffic prompts.

Google Gemini sits between the two: explicit cached-content objects with TTLs you set, similar 0.25× cache-read pricing on Gemini 2.5 Pro / Flash.

A> Why this matters for your cost model. A team migrating an Anthropic-cached workload to OpenAI without changing anything else will see unit economics get worse, not better — because the OpenAI discount is 50% off, not 90% off. Conversely, a team coming the other way and keeping OpenAI’s “do nothing” mental model on Anthropic will leave most of the savings on the table by never marking breakpoints. The mechanism is the same. The numbers and the work required are not.

What the cost actually looks like

A worked example for a 5,000-token system prompt served 10,000 times per day at 90% cache hit rate, on Anthropic with the 5-minute tier:

Without caching: 50,000,000 input tokens × base price.
With caching: 5,000 write tokens × 1.25× + 45,000,000 read tokens × 0.1× + 5,000,000 miss tokens × 1×.
Net: roughly 20% of the original cost, plus a similar reduction in time-to-first-token.

Same workload on OpenAI with automatic caching:

5,000,000 miss tokens × 1× + 45,000,000 cached tokens × 0.5×.
Net: roughly 55% of the original cost.

Different mechanism, same direction; the magnitude depends on the provider you actually use. The 90% hit rate, however, is not automatic on either platform. It is the result of a small set of design choices.

The hit-rate disciplines

Four disciplines decide whether your cache hit rate clears 80%.

Stable-prefix discipline. Everything that does not change per call goes at the front of the prompt, in a fixed order. System prompt, then tool definitions, then domain context, then the conversation history, then — at the very end — the new user turn. A team that re-orders the system prompt and the tool definitions between releases is silently invalidating every cached prefix in production.

No timestamps in the prefix. “The current time is 2026-05-15T14:32:01Z” at the top of the system prompt is the single most common cache-killer in production deployments. Either move it to the end of the prompt, or accept that a per-second-precision timestamp creates a unique prefix for every single call.

Cache breakpoints, used deliberately. Anthropic lets you mark explicit breakpoints — “cache the prompt up to here.” Use them to capture the largest stable prefix you can, and place dynamic-but-shared content (a retrieved document set that is reused for several turns) under its own breakpoint so it caches independently. OpenAI does this for you automatically; the breakpoint is wherever the matched-prefix-with-the-store ends.

Warm the cache where it matters. A 5-minute TTL is fine for high-traffic conversations. For lower-traffic agents — say, an on-call assistant that fires only when an alert comes in — the cache is cold every time. The fix is a periodic synthetic call that hits the prefix and resets the TTL clock; the cache write cost is a fraction of what you save on the next real call.

Where it does not help

Prompt caching is not a universal optimization. The places it gives you nothing:

Heavily personalized prompts. If the prefix is unique per user (the user’s profile interpolated near the top), the cache misses every time.
Streaming small responses. If your call sends 200 tokens of input and gets 50 tokens of output, the savings on the input are dwarfed by the output cost. Cache where the input is large.
Multi-tenant systems with no shared prefix. A B2B platform running 1,000 tenants, each with their own system prompt, gets one cached prefix per tenant. If the per-tenant traffic is low, hit rates collapse. Pull as much as possible into a shared base prefix and put tenant-specific content after a cache breakpoint.
Cross-call ordering changes. A retrieval system that returns the top-5 chunks in similarity-rank order will reorder them between two near-identical queries. The reordered chunks change the prompt prefix and miss the cache. Sort retrieved chunks by a stable key (chunk ID, source path) before insertion.

What prompt caching is not

Not response caching. Response caching returns a stored answer for a repeated question and skips the model entirely. Prompt caching still runs the model — it just reuses the model’s internal state for the part of the prompt that did not change. The two compose.
Not a privacy boundary. A cached prefix at a provider is your cached prefix. It is not shared with other customers, but it is also not encrypted at rest in any way you control. Treat the cache like any other provider-side state.
Not a substitute for prompt engineering. A 10,000-token system prompt that caches well is still a 10,000-token system prompt. It is faster than uncached, but it is also more expensive, more brittle, and slower than a 2,000-token system prompt that does the same job.

Where to start

Pick the single highest-traffic LLM call in your system. Print the prompt. Identify the largest contiguous prefix that does not change per call. Re-arrange the call so that prefix is at the front, every dynamic value is at the back, and any timestamp is not in the prefix. On Anthropic, add the cache_control marker; on OpenAI, the rearrangement alone enables caching automatically.

Then measure. Both providers return cache-hit metadata on every response (Anthropic’s usage.cache_read_input_tokens and cache_creation_input_tokens; OpenAI’s usage.prompt_tokens_details.cached_tokens). Log it. Aggregate it daily. The first week’s hit rate is rarely above 60% on a system that has not been designed for caching; the disciplines in this post will move it past 90% within a week or two.

That measurement loop is the difference between “we enabled caching” and “we cut LLM cost by 80%.” Both are common; the first one is silently underperforming.

References

Prompt caching — Anthropic documentation
Prompt caching — OpenAI documentation
Prompt caching — Google Gemini documentation
Pope et al., 2022 — Efficiently Scaling Transformer Inference (the underlying KV-cache mechanics, in detail)