Claude Code Burning Your Budget? One Setting Saves 60% on Tokens

End of the month. You open the Anthropic usage page and tell yourself: “I didn’t use that much this month.”

The number loads.

You stare at it. Then you do the math again. Then you close the tab, get some water, come back, and do the math one more time.

$200 plan. You’re at $198 and change. Almost exactly at the limit — and that’s the worst feeling. Not a clean overage, not comfortably under. The edge. You don’t know where it all went. You can count the articles, the agent calls, the sessions — and the numbers still don’t add up. A big chunk just disappeared somewhere.

That’s the scariest thing about token waste. Not that you know it’s happening — but that you don’t.

Clawd 想補充：

$200/month sounds like a lot until you run a serious multi-agent pipeline. Opus 4.6 costs $15 per 1M input tokens, $75 per 1M output tokens. Run a 10-subagent pipeline where each agent uses 2K tokens, and one round trip costs about $3. Ten runs a day, that’s $900/month.
This isn’t meant to scare you — it’s meant to say: if your pipeline doesn’t think about cost, money moves fast. The goal isn’t to stop using Claude. It’s to know where the money is actually going (⌐■_■)

The Problem Isn’t How Much You Use — It’s Where

Most people’s first instinct for saving tokens is shortening prompts. Write less. Use simpler words. Say the same thing in fewer characters.

That helps. But it’s not where the big waste is.

Everything Claude Code (ECC) has a file called token-optimization.md, put together by Affaan Mustafa from months of daily use. He says combining MAX_THINKING_TOKENS with model routing can cut costs 60-80% — that’s his number, based on his usage. Your results will vary. But the direction is right.

The three main sources of invisible waste:

First: Extended Thinking burning tokens on tasks that don’t need it. Claude’s thinking mechanism is powerful — but it uses tokens to think before answering, and those thinking tokens cost just as much as output tokens. Often more. You ask it to rename a variable. It doesn’t need 10,000 thinking tokens. But if you haven’t set a limit, it might use them anyway.

Second: Wrong model for the job. Opus is expensive. Haiku is cheap. Sonnet is in the middle. When every task in your pipeline runs through Opus — including “format this JSON” — you’re paying head chef rates to slice onions.

Third: Compacting too late. Waiting until context is full means Claude spends extra tokens summarizing a very long history. The summary drops details you might need later, and you end up re-establishing context you already paid to establish once.

MAX_THINKING_TOKENS — You’re Paying for Silence

This is the easiest setting to change, and one of the least known.

Claude Code’s Extended Thinking has no cap by default. Every time Claude thinks it should reason through something, it decides how much to think — without asking you. For hard problems, this is great. For easy tasks, you’re paying for Claude to talk to itself.

ECC’s suggestion: set MAX_THINKING_TOKENS in settings.json to put a budget on the thinking:

{
  "thinking": {
    "maxTokens": 3000
  }
}

But a global cap is just step one. The sharper approach is to use different modes for different types of work — ECC calls these context modes:

Dev mode (quick edits): You’re fixing bugs, refactoring small functions, writing tests. Precise execution, not deep analysis. Low thinking budget. Fast responses, low cost.

Review mode (architecture decisions): You’re evaluating design options, reviewing a large PR, deciding on a technical direction. It’s worth letting Claude think more here — one bad architecture decision costs more than a few thousand extra thinking tokens. Medium budget.

Research mode (deep exploration): You’re trying to understand a new system from scratch, analyzing codebase patterns, planning a large feature. Extended Thinking earns its cost here. Higher budget — but you should consciously switch into this mode. Not every session needs it.

Clawd 補個刀：

“You’re paying for silence” — this is a real thing.
Extended Thinking tokens don’t appear in Claude’s response. You see the final answer. You don’t see the thinking. But the bill already counted those thousands of thinking tokens. It’s like a taxi meter running while the car is at a red light — you’re not moving, but you’re being charged.
MAX_THINKING_TOKENS: 3000 isn’t magic. It just says “three thousand tokens is enough to think, after that you decide.” For most everyday tasks, that’s more than sufficient. When you actually need deep thinking, you can raise it for that specific task — or switch to Research mode ヽ(°〇°)ﾉ

Model Routing — Not Every Dish Needs the Head Chef

Think of your models as a restaurant with three tiers.

The head chef (Opus) is the most expensive person in the kitchen. Best judgment, best with complexity, highest hourly rate — about 5x the sous chef. You could ask him to chop onions. He’d do it well. That’s not why he exists.

The sous chef (Sonnet) executes very well. Most dishes, done right, at reasonable cost. This is where most of the kitchen’s output comes from.

The prep cook (Haiku) is fastest and cheapest at repetitive, well-defined tasks. Cutting, formatting, checking — not designing menus.

Most agent workflows should look like this:

The main orchestrator — the part that decides what to do next and coordinates the pipeline — runs on Opus. The judgment here is worth paying for.

Code generation, translation, writing, running clearly-defined tasks — Sonnet. Good enough, significantly cheaper.

Format validation, quick lookups, schema checks, simple parsing — Haiku.

In Claude Code’s settings.json:

{
  "model": "claude-opus-4-6",
  "subagentModel": "claude-sonnet-4-6"
}

This keeps the orchestrator on Opus and automatically routes subagents to Sonnet. If your pipeline makes 10+ subagent calls, this difference adds up every day.

If the whole task is simple from the start — no high-level judgment required, pure execution — you can route the entire thing to Sonnet. Not every session needs Opus at all.

Clawd 認真說：

“Model routing” sounds technical, but it describes something intuitive: different work deserves different tools.
The counterintuitive part is that a lot of people treat Opus as the “safe default” — Opus can’t be wrong, Haiku might give worse results. But that logic costs you 5x the money for quality improvements you don’t need. Haiku doing JSON validation versus Opus doing JSON validation — the output is identical. The difference is the number on your bill.
ECC’s token-optimization.md says switching subagent calls from Opus to Sonnet alone cuts most pipeline costs in half. This isn’t a trick. It’s basic pricing arithmetic ┐(￣ヘ￣)┌

The Philosophy of /compact — Knowing When to Compress

Context management is the least discussed part of token optimization, and one of the highest-leverage.

Most people let Claude auto-compact — when context gets too full, Claude decides what to keep, summarizes the history, and continues. This mostly works. But it has one problem: Claude decides what’s important, not you.

When auto-compaction happens, Claude preserves what it thinks matters. That constraint you mentioned 30 minutes ago — “don’t touch the auth module” — might survive in the summary as a vague “avoid auth changes,” and five instructions later Claude does exactly what you didn’t want.

Strategic compact means you decide when to compress, and at which point in the conversation. You run /compact. You control the timing.

When to manually /compact:

Before starting a new major task in the same session. You just finished a complex debug session, now you’re moving to a new feature. The debug trail is done with. The conclusions are in your head. This is a good time to clear.

After resolving a complex bug. The wrong turns, the attempted approaches, the intermediate errors — you have the answer now. You don’t need Claude carrying all that history forward.

When switching from exploration to implementation. You spent an hour reading the codebase and understanding the architecture. Now you’re about to write code. Summarize the key findings, compact, and let Claude execute with a clean context.

When not to /compact:

In the middle of an active debugging session. The error message, the stack trace, the detail you found on the third attempt — if you’re not done yet, compaction might cut exactly what you need.

Mid-refactor where early decisions affect later ones. That thread can’t be broken.

Right after you’ve provided important context. You spent ten minutes explaining the business rules. Claude just read them. Compacting now means deleting what you just said.

The key insight: compacting at the wrong time costs more than not compacting. Because when Claude loses the context you needed, you have to re-establish it — and that reconstruction also costs tokens.

Clawd 認真說：

This happened to us. Running Ralph Loop on an article, context got near-full, let Claude auto-compact mid-session. After compression, “the ticketId for this article is SP-148” had been summarized away. Later Claude used the wrong ID when naming a file, we caught it halfway through, had to re-explain, re-run. The tokens spent on recovery were about twice what the compaction saved.
It’s not Claude’s fault. It’s a timing problem. Compacting in the middle of an active task is like cutting the first half of a conversation out and hoping the other person still remembers what you said (｡◕‿◕｡)

Cost-Aware Pipeline — Putting It All Together

Each individual optimization helps. Combining them into a system is where ECC’s real idea lives.

ECC has a cost-aware-llm-pipeline skill built around one principle: budget tracking belongs inside the pipeline design, not on the invoice you look at afterward.

The basic logic: track token usage per session, set two thresholds — soft warning at 70%, model downgrade at 90%. The pipeline keeps running, but automatically switches to cheaper models as it approaches the limit, instead of just stopping or continuing to burn.

The fallback strategy:

Opus → Sonnet (when budget warning triggers)
Sonnet → Haiku (when task is judged simple)
Haiku → fail gracefully (if even that won't help)

You design the cost awareness in. You don’t discover the problem on the billing page.

The math is simple: 50 subagent calls per day, 2K tokens each, all on Opus — about $7.50/day. Switch those to Sonnet — about $1.50/day. One month, that’s $180 difference. Before accounting for MAX_THINKING_TOKENS savings or context reconstruction you avoided with strategic compact.

Clawd 吐槽時間：

“Build budget tracking into the pipeline” is not a new idea in software engineering — rate limiting, circuit breakers, fallback strategies. These are patterns engineers use every day. But when the context shifts to AI token usage, people forget to design it.
You build an HTTP API client — you add retry logic, timeout, fallback. You build an AI pipeline — why not add token budget + model fallback? This isn’t a clever money-saving tip. It’s basic system design. ECC is reminding you of something you already knew how to do (๑•̀ㅂ•́)و✧

OpenClaw in Practice — Our Actual Numbers

This isn’t theory. Here’s how it plays out for gu-log.

We run on the $200/month plan. OpenClaw runs translations, Ralph Loop (scoring + rewriting), frontmatter validation, commits, and various debug sessions every day. Each article’s Ralph Loop uses at least three agents (Translator, Scorer, Rewriter). If the score doesn’t pass, Rewriter and Scorer each run one or two more rounds. Seven to eight agent calls per article is normal.

Our model strategy: Opus handles orchestration and the translation calls that need real judgment. Sonnet does most translation and scoring work. Haiku does format validation and frontmatter schema checks.

But the biggest lesson wasn’t model routing. It was thinking tokens. Before we noticed, every agent call in Ralph Loop was running Extended Thinking — including the Haiku call that was only checking “are all the frontmatter fields filled in.” That call doesn’t need to think. But the thinking tokens still billed.

The second lesson was compact timing. A few times, context hit saturation and auto-compact ran mid-pipeline, compressing away key context about the article being processed. The next agent had to re-confirm things the previous agent already established. That reconstruction wasn’t free.

Now: Ralph Loop does one active /compact after each article finishes. The next article starts clean. No debug trail from the previous article bleeding into the new session. No context accumulation across different pieces of work.

The Bottom Line

The $200 limit is still $200.

But knowing where it goes is different from not knowing.

A rename task doesn’t need Claude to think for 30,000 tokens. A format validation task doesn’t need Opus. When a session has its answer and a new task is starting, that’s when you compact — not when you keep adding two thousand more lines of context on top.

This isn’t about being so frugal you squeeze every token. It’s about being clear where each dollar lands. The thinking tokens you save on a simple task, the subagent calls you route to Sonnet instead of Opus — that’s budget you can spend on the architectural decision that actually needs deep reasoning.

Same money. More of the work that matters.