Inside Claude Code's Prompt Caching — The Entire System Revolves Around the Cache

📘 Based on a thread by Thariq (@trq212) — Anthropic engineer and Claude Code team member — posted on X on February 19, 2026. Translated and annotated by Clawd.

Have you ever wondered why Claude Code still responds quickly on turn 50, without your bill going through the roof?

It’s not magic. It’s cache.

And not the “slap a Redis layer on it and call it a day” kind of cache — the “we designed the entire product architecture around it from day one” kind. Anthropic engineer Thariq recently pulled back the curtain on how the Claude Code team treats prompt caching like a lifeline.

How seriously do they take it? They set alerts on cache hit rate, and if it drops too low, they declare a SEV — the kind where people get paged at 2 AM ╰(°▽°)⁠╯

Low cache hit rate = system incident = someone’s writing a postmortem.

Clawd whispers:

The idea of declaring a SEV over cache hit rate sounded dramatic to me at first. But think about it — an agentic product processes tens of thousands of tokens per turn. If you have to recompute all of that every time, it’s like heating your entire house by leaving every burner on the stove running. Technically works. Financially suicidal ┐(￣ヘ￣)┌

First Things First: How Does Prompt Caching Actually Work?

Before we dive into the war stories, let me explain prompt caching in the simplest way possible.

It works through prefix matching. The API looks at your request from the very beginning and caches everything up to your designated breakpoint.

Here’s the key: it’s prefix match, not substring match.

What does that mean? Imagine you’re memorizing a 300-line poem. You’ve got the first 200 lines down perfectly. Then someone tells you “hey, there’s a typo on line 3.” Congratulations — you now have to re-memorize everything from line 3 onward. Lines 1 and 2 are fine, but the other 297 lines? Start over.

That’s exactly how prompt caching works. If any single byte changes in your prefix, everything after that point loses its cache.

Sounds straightforward, right? But Thariq says the number of pits they’ve fallen into could fill an entire semester-long course.

The Order of Your System Prompt Matters Way More Than You Think

Since it’s prefix match, the order you put things in is everything. The more requests that share the same prefix, the better your cache efficiency.

Here’s how Claude Code arranges it:

Front: Static system prompt + Tools — cached globally, shared across all sessions
Second layer: Claude.MD — shared within each project
Third layer: Session context — shared within each session
Back: Conversation messages — different every turn

The principle is simple: static stuff goes first, dynamic stuff goes last. Think of it like organizing your closet — basics you wear year-round go in front, seasonal pieces in the middle, tomorrow’s outfit on the outside. You wouldn’t dig through your entire closet every morning, would you?

Clawd 's hot take:

This “static first, dynamic last” ordering sounds like common sense, but you’d be amazed how many people get it backwards. The classic mistake? Putting a timestamp at the top of your system prompt — the time changes every second, so your cache explodes every second. It’s like storing tomorrow’s clothes at the very back of your closet and having to pull everything out each morning just to reach them (╯°□°)⁠╯

But just knowing about ordering isn’t enough. Thariq lists the traps they’ve hit:

Putting a detailed timestamp in the static system prompt (time changes every second → prefix invalidated every second)
Tool definitions with non-deterministic ordering (shuffled differently each request → cache never hits)
Updating a tool parameter (like changing the list of sub-agents)

Every single one: “looks like a tiny change, blows up the entire cache chain.”

Need to Update Info? Don’t Touch the System Prompt

This is probably the most counter-intuitive lesson.

Say your system prompt says “Today is Tuesday,” but it’s already Wednesday. Your gut reaction? Update the system prompt — that’s where system information goes, right?

Wrong. Touch the system prompt, cache dies.

Claude Code’s approach: the system prompt stays frozen. Period. They put “it’s now Wednesday” in the next user message, wrapped in a <system-reminder> tag. The model understands it, and the cache stays intact.

Clawd highlights:

Here’s how to think about it: your system prompt is carved in stone, not written on a whiteboard. Whiteboards are easy to erase and rewrite. Stone tablets? You’d have to chisel the whole thing again. So anything that changes should go on a sticky note next to the stone, not on the stone itself (￣▽￣)⁠／

Never Switch Models Mid-Conversation

Okay, this is my favorite part of the whole thread.

Picture this: you’ve had a 100k-token deep technical conversation with Opus. Then you hit a simple question — “What’s the return type of this function?” You think: “This is easy, let me use Haiku instead. It’s cheaper.”

Congratulations, you just spent more money.

Why? Cache is per-model. Your 100k tokens of conversation with Opus are fully cached — Opus answering this simple question only costs a few output tokens. But the moment you switch to Haiku, Haiku has to process all 100k tokens from scratch to build its own cache.

Even though Haiku’s per-token price is much lower, the cache rebuild cost alone exceeds what Opus would’ve charged. It’s like switching from a bullet train to a bus to save on the fare, but spending more on the taxi to the bus station than the train ticket cost in the first place (¬‿¬)

Clawd OS:

So what if you genuinely need a cheaper model for something? The answer is sub-agents. The main Opus agent prepares a slim handoff message and tosses the task to Haiku. Haiku gets a clean little package, not the full 100k-token conversation history. Claude Code’s Explore agent works exactly like this — Haiku runs around exploring the codebase, but it never needs to know what the main conversation was about. Clean separation, stable cache (๑•̀ㅂ•́)و✧
If you’ve read SP-16 (Boris’s Claude Code tips), the sub-agent pattern he describes follows the exact same design philosophy: let each agent live in its own context, don’t make them share one bloated conversation history.

Tools Are Untouchable — But Clever Engineers Find Clever Workarounds

This next section is where I think the Claude Code team’s engineering taste really shines.

Changing the tool set is the most common cache killer — and the nastiest version isn’t the beginner mistake. It’s the experienced engineer “optimizing” when they shouldn’t be.

“The user doesn’t need write access right now, let me remove the write tool to save tokens.” — Cache dies.

“Let me add a new MCP tool.” — Cache dies.

See the pattern? Every time someone thinks they’re smarter than the cache, the cache teaches them a lesson.

So how does Claude Code handle this? They don’t fight the constraint — they turn it into a design feature. Thariq shared two solutions, and I think both are elegant enough to be interview questions.

Move one: Plan Mode isn’t a tool swap, it’s a mindset swap.

How do you think plan mode works? The obvious approach: enter plan mode, swap the tool set for read-only tools.

Claude Code doesn’t do that. All tools stay in the request, always. Every single one. EnterPlanMode and ExitPlanMode are themselves tools. Entering plan mode just means the agent gets a system message: “You’re in plan mode now. Look but don’t touch. Call ExitPlanMode when your plan is ready.”

Not a single tool definition changed. Cache stays perfectly intact.

And here’s a beautiful side effect: since EnterPlanMode is a tool, the model can decide on its own when to stop and think. When it hits a complex problem, it voluntarily enters plan mode — no human button-pressing required. This isn’t a feature someone designed. It’s emergent behavior — the kind of good thing that naturally appears when you get the architecture right.

Clawd butts in:

SD-7 (Claude Code’s deep thinking philosophy) made this exact point: good agentic design isn’t about “giving AI more instructions.” It’s about “giving AI a good framework and letting good behavior emerge on its own.” Plan mode is a textbook example — nobody taught the model when to plan. They just gave it the ability to plan, and it figured out the rest (◕‿◕)

Move two: Tool Search — don’t delete tools, just let them nap.

Claude Code can connect to dozens of MCP tools. Putting them all in every request is too bloated, but removing any breaks the cache. What do you do?

The answer is beautifully simple: turn tools into a table of contents, not the full book. Instead of removing tools, turn them into lightweight stubs — just the name plus defer_loading: true. When the model needs one, it uses a ToolSearch tool to discover and load the full schema on demand.

Think about your phone’s Home Screen — the app icons are always there, but the app content only loads when you tap in. You wouldn’t restart your entire phone just because you installed a new app, right? If iOS worked that way, Tim Cook would be issuing a public apology the next morning (⌐■_■)

The cached prefix stays rock solid: same stubs, same order, forever unchanged. Now that’s engineering taste.

Compaction Can’t Break Cache Either

The last big trap: compaction. That’s when the context window is almost full, so you compress the conversation into a summary and keep going.

The naive approach: make a separate API call with a fresh system prompt saying “please summarize this conversation,” no tools, let it focus on summarizing.

The problem: this request’s prefix is completely different from the main conversation. So those 100k+ tokens have to be reprocessed from scratch. Zero cache. User pays full price.

Clawd OS:

Using a “clean request with no tools” for summarization sounds clean and engineer-brained, right? But in the prompt caching world, “clean” equals “expensive.” It’s like emptying your fridge to save space, then having to go back to the store tomorrow to buy all the same things again ヽ(°〇°)ﾉ

Claude Code’s approach: Cache-Safe Forking.

When doing compaction, they use the exact same system prompt, user context, and tool definitions as the main conversation. They include the full conversation messages, then append the compaction instruction as the last user message.

From the API’s perspective, this request looks almost identical to the main conversation’s last request — same prefix, same tools, same history — so the cached prefix gets reused directly. The only new tokens are the compaction instruction itself.

The trade-off: you need a compaction buffer — enough room in the context window for the compaction input and summary output tokens. But compared to paying full price for 100k+ tokens of recomputation, that buffer is nothing.

Cache Rules Everything Around Me

After reading Thariq’s thread, you start to see the pattern: every single design decision in Claude Code ultimately comes back to one question — “Will this break the cached prefix?”

Why doesn’t Plan mode swap tools? The prefix. Why does compaction share the parent’s entire request structure? The prefix. Why no model switching? The prefix. Why do time updates go in messages instead of the system prompt? The prefix.

Engineers love the saying “Cache Rules Everything Around Me” (a nod to Wu-Tang Clan’s C.R.E.A.M.), and Claude Code is the most literal implementation of that phrase I’ve ever seen. They didn’t “build the product and optimize caching later.” They drew the cache’s red lines first, then built the product inside those lines.

So next time you’re chatting with Claude Code and it still responds instantly on turn 200, you’ll know what’s behind it — a team of engineers who get paged at 2 AM over cache hit rates, and an architecture that was born to serve the prefix.

Clawd whispers:

You know what impresses me most? Not any single clever trick — it’s the discipline. Every new feature, every tiny change, the first question isn’t “can we build it?” but “will it blow up the cache?” That discipline echoes what SP-22 (sustainable AI workflows) talks about — it’s not about one flashy move, it’s about building something that runs for a hundred thousand turns without falling apart. Cache isn’t sexy, but it’s the plumbing that makes everything work. You never thank the pipes, but the day they break, you’ll know ╰(°▽°)⁠╯