Claude Code Burning Your Budget? One Setting Saves 60% on Tokens
End of the month. You open the Anthropic usage page and tell yourself: “I didn’t use that much this month.”
The number loads.
You stare at it. Then you do the math again. Then you close the tab, get some water, come back, and do the math one more time.
$200 plan. You’re at $198 and change. Almost exactly at the limit — and that’s the worst feeling. Not a clean overage, not comfortably under. The edge. You don’t know where it all went. You can count the articles, the agent calls, the Ralph Loop runs, the Claude Code sessions — and the numbers still don’t add up. A big chunk just disappeared somewhere.
That’s the scariest thing about token waste. Not that you know it’s happening — but that you don’t.
Clawd wants to add:
Clawd here with some quick math. $200/month sounds like a lot until you run a serious multi-agent pipeline. Opus 4.6 costs $15 per 1M input tokens, $75 per 1M output tokens. Run a 10-subagent pipeline where each agent uses 2K tokens, and one round trip costs about $3. Ten runs a day, that’s $900/month.
But here’s the thing ECC is actually saying: most people don’t need Opus for all of that. Just switching the default model from Opus to Sonnet cuts costs by about 60%. The difference isn’t using less — it’s using the right thing (⌐■_■)
The Problem Isn’t How Much You Use — It’s Where
Most people’s first instinct for saving tokens is shortening prompts. Write less. Use simpler words. Say the same thing in fewer characters.
That helps. But it’s not where the big waste is.
Everything Claude Code (ECC) has a file called token-optimization.md, put together by Affaan Mustafa from months of daily use. The core advice is three moves: cap thinking tokens with MAX_THINKING_TOKENS, route models to match the task, and use strategic /compact to control context. Combined, he says the savings reach 60-80% — that’s his number, based on his usage. Your results will vary. But the direction is right.
Three main sources of invisible waste — and the first one might surprise you.
MAX_THINKING_TOKENS — The 31,999-Token Invisible Bill
Extended Thinking is how Claude reasons before answering — and those thinking tokens cost just as much as output tokens. Often more. The problem is the scale.
What most people don’t realize: the default cap is 31,999 tokens. Every time Claude decides it needs to think through something, it can burn up to 31,999 tokens of reasoning before producing a single word of answer. For a complex architecture decision, that’s worth it. For renaming a variable? Not even close. But without a custom limit, Claude might use a big chunk of that budget anyway.
Those thinking tokens don’t appear in Claude’s response. You see the final answer. You don’t see the thinking. But the bill already counted all of it. Like a taxi meter running while the car is at a red light — you’re not moving, but you’re being charged.
ECC’s suggestion: set MAX_THINKING_TOKENS in the env block of your ~/.claude/settings.json:
{
"env": {
"MAX_THINKING_TOKENS": "10000"
}
}
From 31,999 down to 10,000. ECC estimates this alone saves about 70% on thinking token costs. For most everyday coding tasks, ten thousand tokens of thinking space is more than enough. When you actually need deep reasoning — untangling a cross-module bug, planning a system architecture — switch to Opus with /model opus and let it think at full capacity.
For purely mechanical tasks like formatting or renaming, you can even set it to "0" to disable thinking entirely. No thinking needed, no thinking billed.
Clawd going off-topic:
“Paying for silence” is a good phrase, but to be precise: 31,999 isn’t “no limit” — it’s a very high default that feels like no limit for most everyday tasks.
ECC recommends 10,000. Not a magic number, but the mental model is clear: most daily coding tasks don’t need more than 10K tokens of reasoning. The times you’d want to switch to
/model opusfor full-depth thinking — maybe once or twice a day. The rest of the time, 10K is plenty ヽ(°〇°)ノ
Model Routing — Not Every Dish Needs the Head Chef
Here’s the most counterintuitive recommendation in ECC: don’t default to Opus. Default to Sonnet.
Think of your models as a restaurant with three tiers.
The head chef (Opus) is the most expensive person in the kitchen. Best judgment, best with complexity, highest hourly rate — about 5x the sous chef. You could ask him to chop onions. He’d do it well. That’s not why he exists.
The sous chef (Sonnet) executes very well. Most dishes, done right, at reasonable cost. This is where most of the kitchen’s output comes from.
The prep cook (Haiku) is fastest and cheapest at repetitive, well-defined tasks. Cutting, formatting, checking — not designing menus.
ECC’s token-optimization.md recommends this configuration:
{
"model": "sonnet",
"env": {
"MAX_THINKING_TOKENS": "10000",
"CLAUDE_CODE_SUBAGENT_MODEL": "haiku"
}
}
model set to sonnet — your daily default is Sonnet, not Opus. ECC says Sonnet handles about 80% of coding tasks well. That one switch alone saves roughly 60%. When you need Opus, type /model opus — for complex reasoning, large architecture decisions, the moments that are worth paying for.
CLAUDE_CODE_SUBAGENT_MODEL set to haiku — subagents (spawned by the Task tool) run on Haiku. They’re typically reading files, looking things up, running tests. Haiku is about 80% cheaper for that kind of work, and the quality is sufficient.
This echoes SP-150’s indie hacker philosophy — money isn’t the enemy, misallocation is. A lot of people treat Opus as the “safe default” — Opus can’t be wrong, Haiku might give worse results. But that logic costs 5x for quality improvements you don’t need. Haiku doing JSON validation versus Opus doing JSON validation — the output is identical. The difference is the number on your bill.
Clawd wants to add:
ECC’s advice reads counterintuitive at first: default Sonnet, subagent Haiku, only Opus when you need it. Most people’s instinct is “Opus is strongest, just use Opus for everything.”
But think about it: the source says subagents on Haiku save 80%, switching the main model from Opus to Sonnet saves 60%. These aren’t hacks. It’s basic pricing arithmetic — you don’t pay head chef rates to chop onions ┐( ̄ヘ ̄)┌
The Philosophy of /compact — Knowing When to Compress
Context management is the least discussed part of token optimization, and one of the highest-leverage. (If you’re interested in pipeline design for Claude Code, SP-151’s eval-driven development covers how to automate quality through evaluation loops — the quality side of the same coin as cost automation here.)
Most people let Claude auto-compact — when context window gets too full, Claude decides what to keep, summarizes the history, and continues. But auto-compaction has a fundamental problem: Claude decides what’s important, not you.
When it happens, Claude preserves what it thinks matters. That constraint you mentioned 30 minutes ago — “don’t touch the auth module” — might survive in the summary as a vague “avoid auth changes,” and five instructions later Claude does exactly what you didn’t want.
ECC has a strategic-compact skill built specifically for this: it suggests /compact at logical breakpoints rather than waiting for context to hit the wall. The key insight is that you control the timing, not the system at its worst moment.
When to manually /compact:
After exploration, before implementation. You’ve read 20 files and understood the architecture. The exploration trail doesn’t need to come with you into the coding phase. Compact, carry the conclusions, start clean.
After resolving a complex bug, before the next task. The wrong turns, the attempted approaches, the intermediate errors — you have the answer now. You don’t need Claude carrying all that history forward.
After completing a milestone. This is the core design of ECC’s strategic-compact — compress at natural task boundaries, not in the middle.
When not to /compact:
Mid-debugging session. The error message, the stack trace, the detail you found on the third attempt — if you’re not done yet, compaction might cut exactly what you need.
Mid-refactor where early decisions affect later ones. That thread can’t be broken.
Right after providing important context. You spent ten minutes explaining business rules. Claude just read them. Compacting now means deleting what you just said.
The key insight: compacting at the wrong time costs more than not compacting. Because when Claude loses the context you needed, you have to re-establish it — and that reconstruction also costs tokens.
Clawd PSA:
This happened to us. Running Ralph Loop on an article, context got near-full, let Claude auto-compact mid-session. After compression, “the ticketId for this article is SP-148” had been summarized away. Later Claude used the wrong ID when naming a file, we caught it halfway through, had to re-explain, re-run. The tokens spent on recovery were about twice what the compaction saved.
That’s exactly the problem ECC’s strategic-compact solves — compress at task boundaries, not in the middle. Mid-task compaction is like cutting the first half of a conversation out and hoping the other person still remembers what you said (。◕‿◕。)
Three Moves, One settings.json
Each individual optimization helps. Combining them into a system is where ECC’s real idea lives. The full settings.json:
{
"model": "sonnet",
"env": {
"MAX_THINKING_TOKENS": "10000",
"CLAUDE_CODE_SUBAGENT_MODEL": "haiku"
}
}
Three settings doing three things: Sonnet as default (saves ~60%), thinking tokens from 31,999 down to 10,000 (saves ~70%), subagents on Haiku (saves ~80%). Add manual /compact at logical breakpoints to clean context, and that’s where ECC’s 60-80% cost reduction comes from.
Layer on ECC’s cost-aware-llm-pipeline skill concept — building budget tracking directly into pipeline design. The idea is similar to hooks — SP-146’s Hook Architecture inserts automated behavior at key pipeline nodes; this inserts cost awareness at the same nodes.
This isn’t new in software engineering. You build an HTTP API client — you add retry logic, timeout, fallback. You build an AI pipeline — why not add token budget + model fallback?
Clawd OS:
The math is simple: 50 subagent calls per day, 2K tokens each. All on Opus — about $7.50/day. All on Haiku — under $1. That’s over $190/month difference.
Add Sonnet as your main model and thinking token caps on top, and ECC’s 60-80% reduction isn’t a marketing number — it’s three multipliers stacking. And here’s the thing: after making these changes, day-to-day usage feels almost identical. When you actually need Opus,
/model opusbrings it right back. This isn’t a “money-saving tip.” It’s basic system design (๑•̀ㅂ•́)و✧
OpenClaw in Practice — Our Actual Numbers
This isn’t theory. Here’s how it plays out for gu-log.
We run on the $200/month plan. OpenClaw — our 24/7 automated translation agent — runs translations, Ralph Loop (scoring + rewriting), frontmatter validation, commits, and various debug sessions every day. Each article’s Ralph Loop uses at least three agents (Translator, Scorer, Rewriter). If the score doesn’t pass, Rewriter and Scorer each run one or two more rounds. Seven to eight agent calls per article is normal.
Our model strategy is slightly different from ECC’s recommendation: Opus handles orchestration and the translation calls that need real judgment, Sonnet does most translation and scoring work, Haiku does format validation and frontmatter schema checks. ECC suggests Sonnet as default with Haiku subagents, which is designed for general coding workflows — gu-log’s translation pipeline has higher language quality demands, so keeping the orchestrator on Opus is a deliberate choice, not laziness.
But the biggest lesson wasn’t model routing. It was thinking tokens. Before we noticed, every agent call in Ralph Loop was running Extended Thinking — including the Haiku call that was only checking “are all the frontmatter fields filled in.” That call doesn’t need to think. But with the 31,999-token default, it had permission to — and the bill reflected it. (SP-143’s autonomous loops touches on the same problem — the more autonomous the loop, the more important cost controls become, because no one is there to hit the brakes.)
The second lesson was compact timing. A few times, context window hit saturation and auto-compact ran mid-pipeline, compressing away key context about the article being processed. The next agent had to re-confirm things the previous agent already established. That reconstruction wasn’t free.
Now: Ralph Loop does one active /compact after each article finishes. The next article starts clean. No debug trail from the previous article bleeding into the new session. No context accumulation across different pieces of work.
The Bottom Line
The $200 limit is still $200.
But knowing where it goes is different from not knowing.
A rename task doesn’t need Claude thinking through 31,999 tokens of reasoning space. A format validation task doesn’t need Opus — Haiku will do. When a session has its answer and a new task is starting, that’s when you compact — not when you keep adding two thousand more lines of context on top.
ECC’s three-line settings.json isn’t a secret weapon — it just says: put the money where it matters. Sonnet for daily work, Haiku for grunt work, Opus only when it’s worth it. The budget you save isn’t gone. It’s budget you can spend on the architectural decision that actually needs deep reasoning.
Same money. More of the work that matters.