Also known as: prompt cache, prefix caching API, cached prompts
TL;DR
Prompt caching is the API-side feature that lets you reuse a model provider's KV cache for a stable prompt prefix across requests. You mark the prefix as cacheable, the provider keeps its KV cache warm.
Prompt caching is how Anthropic, OpenAI, and Google expose the underlying KV cache to API users. You mark a prefix of your prompt as cacheable; the provider keeps its computed KV state warm; subsequent requests that share that prefix skip prefill on the cached tokens and pay roughly 10% of the normal input-token rate. Mechanically it is just prefix caching with billing semantics attached. Strategically it is the single highest-leverage cost optimization for RAG, agents, and long-system-prompt applications.
How the providers expose it
The mental model is identical across vendors; the surface differs.
API surfaces
Anthropic — cache_control: { type: "ephemeral" } on a content block. Up to 4 cache breakpoints per request. 5-minute TTL default, 1-hour option at 2x write cost. Cached reads at 10% of base input price.
OpenAI — automatic. The provider hashes prompt prefixes and reuses cache when it detects a match. No explicit marker; ordering and stability of your prompt determine hit rate. Cached input tokens billed at 50% (GPT-4.1 family) or lower.
Google Vertex AI — explicit CachedContent objects you create, parameterize with TTL, and reference by handle. More work to wire up, more control over lifetime.
Self-hosted (vLLM, SGLang) — automatic prefix caching at the KV layer. No billing; just memory.
The economics are similar in shape: a cache write costs slightly more than a normal input token, a cache read costs a fraction of one, and the break-even is usually two to three reuses of the same prefix.
Structuring the prompt for hits
Cache hits require exact prefix match, position by position. The discipline is to put everything that is stable up front and everything that varies at the end.
The recipe most production teams converge on:
System prompt and role definition — the most stable layer, written once.
Tool descriptions and few-shot examples — change rarely, on every model upgrade.
Retrieved documents — change per query, but stable for a multi-turn conversation about the same topic.
Conversation history — append-only; prior turns stay cached as new ones arrive.
The current user message — never cached.
Anthropic gives you four cache breakpoints; one common pattern is to place them at the boundaries above (system, tools, retrieved-context, history) so each layer caches independently. A new retrieval invalidates from layer 3 onward but leaves layers 1-2 hot.
A small change in any earlier layer cascades — a single appended sentence to the system prompt invalidates every cache below it. Treat the system prompt as a versioned artifact, not a string you tweak casually.
What it costs you
Prompt caching is not free; it is cheaper. Specifically:
Cache writes cost more than uncached reads. Anthropic charges 1.25x for ephemeral writes, 2x for the 1-hour variant. A prefix that gets used once is more expensive than no caching at all.
TTLs are short and undocumented in detail. Idle traffic kills the cache. Bursty workloads with intermittent silence may pay for warm-up repeatedly.
Cache reads still incur full output costs. Caching saves on prefill and input billing only. If you generate 2K tokens per request, that part is unchanged.
When it pays off most
Prompt caching wins biggest in three workloads. RAG with stable system prompts and rapidly varying user queries — the system prompt and tool block stay hot, and only the per-query retrieved passages and user message pay full price. Agent loops where the conversation grows append-only across many turns — every prior turn is cached prefix on the next iteration. And evaluation harnesses where you replay the same prompt across many candidate outputs — cache hit rates approach 100%.
Outside those, the savings are modest. For one-off, high-variance prompts (a typical chatbot’s first turn), caching is a small win at best. The discipline is to design the prompt for caching — pick orderings that maximize prefix reuse — rather than retrofit caching onto whatever shape the prompt happened to take.
Prompt caching is the cheapest 5-10x cost cut available to a RAG or agent stack. The price is treating your system prompt as a versioned artifact, not a string you tweak.
Go further
How long does the cache actually live?
Anthropic offers a 5-minute default TTL with a 1-hour option at higher cost. OpenAI keeps prompts cached for 5-10 minutes idle, longer under sustained traffic. Google's Vertex AI uses explicit cache objects with developer-controlled TTLs. None are guaranteed durable — under memory pressure, the provider may evict early.
The KV cache is computed sequentially. A cache hit requires the new request's tokens to match the cached prefix exactly, position by position. Move one token in the prefix and the cache invalidates from that point onward. Stable content goes first; volatile content (the user message) goes last.
When prefixes change every request (no reuse possible), when total prefix size is below the provider's minimum (Anthropic requires 1024 tokens for Sonnet/Opus), or when traffic is too sparse to keep the cache warm. The cache write itself costs more than a normal input token, so a single-use cache is a net loss.