prompt-caching - Tags

Prompt Cache Economics — Why Your AI Bill Is Higher Than You Think

SD-13 2026-04-02 · ShroomDog Lab

Prompt caching should save you 90% on token costs — but one obscure bug can silently make you pay 10x more. From DANGEROUS_uncachedSystemPromptSection to the cch=00000 billing trap hidden in Claude Code's DRM, here's why prompt engineers now need to be accountants too.

Anthropic Prompt Caching Deep Dive — Automatic Caching, 1-Hour TTL, and the Gotchas They Don't Tell You

GP-112 2026-03-13 · Anthropic Official Docs

Anthropic's prompt caching got major updates in 2026: Automatic Caching removes manual breakpoint headaches, 1-hour TTL keeps caches alive longer, and the invalidation hierarchy decides what blows up when you change things. Plus our real-world $13.86 billing disaster story.

claude-code cost-optimization api

Inside Claude Code's Prompt Caching — The Entire System Revolves Around the Cache

GP-73 2026-02-19 · @trq212 on X

Anthropic engineer Thariq shares Claude Code prompt-caching lessons: system prompt order matters, tools cannot change mid-conversation, switching models costs more than staying, and compaction must share the parent's prefix. Real SEV alerts included.

claude-code optimization cost ai-agents

The LLM Context Tax: 13 Ways to Stop Burning Money on Wasted Tokens

MP-65 2026-02-11 · Nicolas Bustamante (@nicbstme)

The 'Context Tax' in AI brings triple penalties: cost, latency, & reduced intelligence. Nicolas Bustamante's 13 Fintool techniques cut agent token bills by up to 90%. A real-money guide for optimizing AI context, covering KV cache, append-only context, & 200K token pricing.

context-engineering llm cost-optimization ai-agents kv-cache token-efficiency claude-code

Prompt Caching Money-Saving Guide: Your API Bill Can Lose a Zero (Series 1/3)

GP-31 2026-02-05 · @dejavucoder on bearblog

An AI engineer stuffed user-specific data into the system prompt, watched his bill double, and learned his lesson. Plus six practical tips to consistently hit prompt cache. (Part 1 of 3)

cost-optimization llm-inference

Inside LLM Inference: KV Cache & the Memory Nightmare (Series 2/3)

GP-32 2026-02-05 · @dejavucoder on bearblog

Part 1 taught you how to save money. Part 2 explains why those tricks work. From the two stages of LLM inference (prefill/decode) to KV cache fundamentals to the GPU memory crisis that makes naive caching fall apart at scale. (Part 2 of 3)

kv-cache llm-inference

Paged Attention + Prefix Caching: The Ultimate GPU Memory Hack (Series 3/3 Finale)

GP-33 2026-02-05 · @dejavucoder on bearblog

Operating systems solved memory fragmentation with paging decades ago. vLLM brought that same trick to GPUs, added block hashing and prefix caching, and made prompt caching a reality. Series finale — every puzzle piece clicks into place.

paged-attention vllm