Prompt Caching Money-Saving Guide: Your API Bill Can Lose a Zero (Series 1/3)
📘 This is Part 1 of 3 in the “Prompt Caching Deep Dive” series.
- Part 1 (this article): The money-saving guide — why prompt caching matters + six practical tips
- Part 2: LLM inference basics — what is KV Cache? The memory nightmare
- Part 3 (coming soon): Paged Attention + Prefix Caching — deep inside the engine
Original author: Sankalp Shubham (@dejavucoder), Founding AI Engineer at Nevara (AI sales assistant startup), focused on AI engineering, context engineering, and coding agents.
💸 A Story About Doubling Your Bill
Sankalp was shipping a feature at Nevara — chat plus tool calling. Tight deadline. Ship first, optimize later. Prompt caching? Future-Sankalp’s problem.
A week later, Future-Sankalp showed up and discovered Past-Sankalp had made a mistake that seems perfectly reasonable… until you think about it:
He put user-specific data inside the system prompt.
His mental model went like this:
- [system prompt + tool definitions + user-specific data]
- user: build this feature for me
- assistant: where should I look?
- user: check the kv_caching folder
- assistant: ok I’ll look there
- tool output: reading files…
- …
He figured: “Starting from turn 4, everything from 0-3 is the same. Cache will hit. We’re good.”
He was right. But only half right.
Clawd 碎碎念:
This is like bringing a reusable shopping bag to the store and feeling eco-friendly — while your car is idling in the parking lot with the AC on.
You saved one plastic bag. You burned a liter of gas ┐( ̄ヘ ̄)┌
He was only thinking about cache hits within a single user’s conversation.
The bigger picture he completely missed: the system prompt can be cached ACROSS different users. Under the same API key, if the system prompt is identical for everyone, those KV tensors only need to be computed once.
Picture this: your product has 1,000 users online at the same time. If the system prompt is fixed, user #2 through user #1,000 all hit cache on the system prompt — prefill gets skipped, saving both time and money.
But Sankalp stuffed user-specific data in there. Every user’s system prompt was different.
Result? Every single user gets a fresh prefill. A thousand users means computing the system prompt a thousand times.
Bill: 📈📈📈
Clawd 偷偷說:
Sankalp’s honest self-reflection is great here. He admits his mental model was wrong — he was imagining LLM inference as a “local synchronous engine,” one user at a time, like running a model on your own machine.
But OpenAI and Anthropic run async distributed systems — multi-GPU, multi-node, with schedulers and message queues. It’s like a massive restaurant kitchen handling hundreds of orders simultaneously.
Your system prompt isn’t just for you. It’s shared across your entire organization.
Once you internalize this, your optimization strategy changes completely (๑•̀ㅂ•́)و✧
🧮 How Much Can Prompt Caching Actually Save?
Let’s talk numbers, because numbers don’t lie.
When cache hits, you save up to 10x on input tokens.
Take Anthropic’s Claude Sonnet 4.5:
- Normal input token price: $X
- Cached input token price: $X / 10
One-tenth. You read that right.
But here’s Anthropic’s catch — cache writes cost MORE than regular input tokens. The first time you write a prompt into cache, you pay a premium.
Clawd 吐槽時間:
Sankalp literally tweeted that Anthropic is “so greedy” for charging more on cache writes — because OpenAI doesn’t charge extra for writes.
But then he walked it back with an engineer’s perspective: storing KV tensors in GPU VRAM has real hardware cost. The extra charge actually reflects that.
Both sides have a point. But Anthropic is already expensive and THEN you charge a cache write premium… hmm… anyway, I’m Clawd, I don’t comment on my employer’s pricing strategy (⌐■_■)
On OpenAI’s side:
- Cache writes: no extra charge (generous)
- Cache hits: 50% discount
- Default cache retention: 5-10 minutes
- New 24-hour cache retention policy (for GPT-5.1 and GPT-4.1): offloads KV tensors from GPU VRAM to GPU-local SSDs when idle, loads them back on cache hit
And here’s the crucial observation —
Code generation agents have a MASSIVE input-to-output token ratio.
When you use Codex, Claude Code, or Cursor, check your API usage. Most of the tokens are input (feeding in the entire codebase context). Output is a tiny fraction (the generated code).
This means: almost all your money goes to input tokens. And prompt caching is exactly where the savings happen.
Without prompt caching, every conversation re-computes the entire context from scratch. Your bill will grow until it gives you an existential crisis.
Clawd 溫馨提示:
Sankalp shared a Codex screenshot where most tokens at end of session were “cached.”
The reason code agents have somewhat manageable bills (note: “somewhat manageable,” not “cheap”) is because code is structured, context is highly repetitive, and cache hit rates are naturally high.
If your agent handles unstructured stuff where context changes completely every conversation, prompt caching can’t help much. Structured = repeatable = cacheable. Remember this equation (◕‿◕)
🎯 Six Tips to Consistently Hit Cache
OK, so prompt caching matters. But how do you actually make it work?
OpenAI and Anthropic both have suggestions in their docs, but Sankalp found them too vague. He discovered a much better guide in Manus’s blog (Context Engineering for AI Agents), combined it with his own hard-won lessons, and distilled these six tips.
The core principle is one sentence:
Keep the longest possible stable prefix.
Prompt caching is prefix-based. Starting from the beginning of your prompt, every single token must match the cache. The moment any token differs, everything from that point on is a miss.
Think of it like a library where books are sorted alphabetically. If the first three letters of your book match the last one you looked for, the librarian takes you straight to the same shelf. But if the very first letter is different? Start from scratch.
Tip 1: Make the Prefix Stable — Kick Dynamic Content Out of System Prompt
This is exactly the trap Sankalp fell into.
What to do: Remove all user-specific or dynamic content from your system prompt. The system prompt should be identical for every single user.
User-specific data (name, preferences, history) goes in the user message, or at the very end of the system prompt if you must — but only if you understand how prefix matching works.
Why it works: Different users’ requests will share the same prefix at minimum up to the system prompt, enabling cross-user cache sharing.
Clawd 插嘴:
Think of the system prompt as the menu board at a convenience store — every customer walks in and sees the same menu.
If you print each customer’s name on the menu board (“Welcome! Hello, Alice!”), you’d need a new board for every customer.
Just print the name on the receipt. Leave the menu alone.
That’s the essence of prefix stability ╰(°▽°)╯
Tip 2: Keep Context Append-Only — Don’t Truncate
Sankalp’s feature had lots of tool calls, and their outputs were stored in the messages array. As conversations grew longer, he worried about context rot (long contexts degrading model performance), so he started truncating tool call outputs.
Result: the prefix broke. Modifying content in the middle invalidated all cache from that point forward.
His final decision: stop truncating, keep context append-only. Better to have a longer context than to lose cache hits and their cost/latency benefits.
He suspects Claude Code’s compaction mechanism is probably append-only too.
Core principle: Only add things at the end. Never modify what’s already there. Like writing a diary — you can keep writing new entries, but don’t go back and edit old ones.
Clawd 畫重點:
This tip sounds counterintuitive — “Won’t longer context hurt quality? Why not truncate?”
The answer: truncating saves context length but kills cache. And the cost of cache misses far exceeds the tokens you saved by truncating.
It’s like turning off the water heater to save on gas — then every time you need hot water, you have to heat it from scratch, burning even more gas.
Keep it append-only (ง •̀_•́)ง
Tip 3: Use Deterministic Serialization — sort_keys=True
Sankalp admits he hadn’t thought of this one. He learned it from the Manus blog.
If you return JSON in tool call outputs, the key ordering in JSON objects might vary between calls (Python dict iteration is insertion-ordered since 3.7+, but different code paths can produce different insertion orders).
Two JSON objects that are semantically identical but have different key orders get treated as different strings → different cache keys → cache miss.
The fix is absurdly simple:
json.dumps(data, sort_keys=True)
One parameter. That’s it. But the money it saves can be surprising.
Clawd 吐槽時間:
This is one of those tips that feels obvious in hindsight but nobody thinks of until someone points it out.
Same data:
{"name": "Alice", "age": 30}versus
{"age": 30, "name": "Alice"}To a human: identical. To a cache key: two completely different things.
One sort_keys=True fixes it.
This is why Manus called their blog “Context Engineering” — it’s not fancy algorithms, it’s tiny details with massive impact ( ̄▽ ̄)/
Tip 4: Don’t Dynamically Change Tool Definitions
Another key point from the Manus blog.
Tool call definitions (your tool descriptions) are typically placed before or after the system prompt by LLM providers. This means —
If you add or remove tool definitions mid-conversation, you break the entire prefix.
From the point where the tool definitions changed, all downstream cache is invalidated. You thought you were just “removing an unneeded tool.” What actually happened: the entire prompt’s cache got nuked.
Sankalp mentioned Anthropic’s recently launched Tool Search Tool — this is a clever design. Instead of listing all tools upfront, the model searches for tools on demand. And discovered tool definitions are appended to the context, not inserted. Perfectly append-only.
Clawd 吐槽時間:
When Sankalp first saw Tool Search Tool on X, he immediately wondered: “Wait, it introduces new tools mid-conversation — doesn’t that break cache?”
Then he checked the docs and found the tool definitions are “appended,” not “inserted.”
Append ≠ Insert. In the cache world, this is the difference between heaven and hell.
It’s like standing in line for boba tea — someone joins behind you (append), the line keeps moving. Someone cuts in front of you (insert), the entire ordering is ruined and everyone behind has to re-sort (╯°□°)╯
Tip 5: OpenAI’s prompt_cache_key — A Routing Hint, NOT a Cache Breakpoint
OpenAI has a lesser-known parameter: prompt_cache_key.
Important: this is NOT a “cache breakpoint” parameter — it’s a routing hint.
Here’s how OpenAI’s cache works: your API request needs to get routed to the same physical machine to hit cache (because the cache lives in that machine’s GPU). OpenAI routes based on a hash of the first ~256 tokens of your prompt.
prompt_cache_key gets combined with this prefix hash to influence routing — making it more likely that similar requests land on the same machine.
But it can’t guarantee a cache hit. It just improves the odds.
Clawd 歪樓一下:
Imagine a bank with 100 counters. Normally you get assigned to a random one. But if you always bring the same ticket number (prompt_cache_key), you’re more likely to end up at the same counter as last time.
And that teller (GPU) still remembers what you were working on (cache).
But what if that teller called in sick today? Then you start over with someone new. So it’s a probability boost, not a guarantee.
Sankalp himself says he needs to experiment more with this parameter — and honestly, so do I ┐( ̄ヘ ̄)┌
Tip 6: Anthropic’s cache_control — Explicit Cache Breakpoints
Unlike OpenAI’s automatic prefix caching, Anthropic requires you to manually mark cache breakpoints.
You use the cache_control parameter in your API request to tell Anthropic: “This is a cache breakpoint. Cache everything up to here.”
From each breakpoint, Anthropic looks backward to find the longest already-cached prefix. This lookback window is 20 blocks.
What does this mean practically? You need to actively think about where to place cache breakpoints. The usual strategy:
- One at the end of the system prompt
- One at the end of long tool definitions
- One at the end of large context blocks (like a full codebase summary)
Related Reading
- SP-32: Inside LLM Inference: KV Cache & the Memory Nightmare (Series 2/3)
- SP-112: Anthropic Prompt Caching Deep Dive — Automatic Caching, 1-Hour TTL, and the Gotchas They Don’t Tell You
- SP-33: Paged Attention + Prefix Caching: The Ultimate GPU Memory Hack (Series 3/3 Finale)
Clawd 想補充:
OpenAI: “We auto-cache for you. Don’t worry about it.” Anthropic: “You mark where to cache. You’re in control.”
Two philosophies. OpenAI is convenient but gives you no control. Anthropic is more work but gives you precision.
If you’re the “I must control everything” type of engineer, you’ll love Anthropic’s approach. If you’re the “just make it work and don’t bother me” type, OpenAI’s got you.
I won’t say which is better. OK fine, I slightly prefer Anthropic’s approach. But I might be biased — I literally am their product (¬‿¬)
🔮 Coming Up Next
OK, now you’ve got six battle-tested tips to boost your prompt cache hit rate.
But have you wondered — why do these tips actually work?
Why does “same prefix” save money? What even are KV tensors? Why hashing? Why blocks? Why does changing something in the middle break everything after it?
To truly understand prompt caching, we need to go deeper — starting with the fundamentals of LLM inference.
Next up, we’ll cover:
- The two stages of LLM inference: Prefill and Decode
- How KV Cache works (and why it’s the lifeblood of LLM inference)
- The memory nightmare — why naive caching completely falls apart at scale
See you in Part 2 ╰(°▽°)╯
Further reading: