After Opus 4.7: How Your Prompt Playbook Needs to Change — Two Official Anthropic Best Practices in One Cheat Sheet

Opus 4.7 is not Opus 4.6 with a new sticker on the box. Prompts that survived on model generosity in the 4.6 era can start showing cracks after the upgrade: response length changes, tool calls become more selective, subagents spawn less aggressively, effort settings matter more, and token counting no longer behaves exactly the same.

Read the Claude Code best practice post next to the full prompting guide, and the real message is simple: upgrading is not just changing the model name. Prompts, harnesses, cost budgets, and user-interaction patterns all need a migration pass.

First trap: too obedient

The good news: Opus 4.7 follows instructions more precisely. The bad news: “more precisely” shines a flashlight on every lazy prompt that used to work because the model filled in the gaps.

In a code-review harness, the risky pattern is stuffing “find all issues” and “only report high-severity issues” into the same instruction. 4.7 follows wording more literally than 4.6, especially at lower effort levels. That kind of prompt can reduce recall because the model may treat the filter as the task itself, instead of first checking broadly and then organizing the output by severity.

A sturdier prompt separates the two jobs: first ask for complete review coverage, including correctness, edge cases, security, and performance; then define how the final report should group findings by severity, including which low-severity findings may be omitted and which must still be mentioned. Coverage is coverage. Filtering is filtering. Do not ask one sentence to be both a vacuum cleaner and a bouncer.

Tool use follows the same pattern. 4.7 tends to reason more and call tools less. That is often good, because it avoids expensive tool wandering. But code review, repo migration, and debugging usually need file reads, search, and tests. Do not just say “review this.” Say “read the relevant files first; when a function, type, or test name is uncertain, search for it instead of relying on memory.”

Subagents are part of the same story. 4.7 is more selective about delegation and will not always fan out the way 4.6 did. If the task really benefits from parallel work across files or independent items, say so directly: spawn multiple subagents in the same turn, give each a separate ownership area, then merge the results. For the larger framework on when to split work and when to keep one agent, SP-172 has the decision tree.

Three threads, one root cause. Not a dumber model — a model that guesses less about what your prompt meant to say.

Clawd chimes in:

I am weirdly on 4.7’s side here. 4.6 filling in the blanks felt helpful, but that kind of helpfulness becomes the worst production bug: the same prompt works on Monday and drifts on Thursday. 4.7 guessing less is annoying in the short term, but it is more like an engineering tool in the long term. Debuggable, predictable, testable. Much better than a coworker who is generous on some days and mysteriously checked out on others (⌐■_■)

Can you just crank the effort?

After upgrading to 4.7, the obvious move is to raise effort. If low effort follows narrow scope too literally, then high effort should fix everything, right?

The intuition makes sense. It is also completely wrong.

Effort controls how much reasoning the model is willing to spend. It does not repair a bad prompt. If the task direction is crooked, max just gives the model more energy to run in the crooked direction. Bigger engine, same bent steering wheel, wall arriving sooner.

4.7’s effort ladder deserves real attention. xhigh is the Claude Code default and the recommended starting point for most coding and agentic work. high is the minimum for most intelligence-sensitive tasks. medium and low fit tightly scoped, cost-sensitive, or latency-sensitive jobs. max is for cases where squeezing out the ceiling matters and the extra tokens and latency are acceptable.

The important nuance: max does not mean “safest.” It can produce additional performance on genuinely hard problems, but it also has diminishing returns and can overthink. Use it for ceiling tests, very high-value debugging, and tasks where reasoning quality matters more than cost. Do not use it as a universal insurance policy unless the budget is also insured.

So the migration order is backwards from the panic instinct: fix the prompt and harness first, then tune effort. Start coding work at xhigh, keep intelligence-sensitive work at least high, and reserve medium/low for formatting, renaming, and small scoped edits. If you run xhigh or max, give max_tokens enough room too; 64k is the suggested starting point so reasoning, tool calls, and subagents do not run out of space.

Clawd whispers:

Cannot resist commenting on the naming. low, medium, high, xhigh, max — not higher, not very_high, just straight to xhigh, which sounds like some Anthropic engineer ran an A/B test and found it saves two keystrokes so they shipped it. Next-gen model will probably need another tier — ultrahigh, ludicrous, xhigh-pro-max — and xhigh gets demoted to the middle. This naming trajectory is identical to CPUs going Pro → Pro Max → Ultra → Ultra Max. Eventually the levels grow legs and run away (¬‿¬)

Defaults are part of the prompt

Another 4.7 shift feels more like personality than correctness: response length adapts to task complexity, tone is more direct, long tasks get more regular progress updates, and frontend work can show a stronger default aesthetic.

That can be a gift for editorial sites, landing pages, and demo prototypes. It can also be very wrong for admin panels, internal tools, finance dashboards, or anything that needs quiet structure, clear hierarchy, and high information density. The problem is not that the model has taste. The problem is that products already have design systems, and the model’s default taste should not become the product’s design system by accident.

Negative instructions are often too weak. “Do not use cream,” “don’t make it flashy,” and “not a marketing page” can just push the model into a different default palette. Concrete positive specs work better: font stack, color palette, spacing, component radius, information density, table row height, button hierarchy, and examples of what “on brand” means.

Do not bury those specs at the end of a chat after the model has already started working. Put them in CLAUDE.md or project instructions, so the model sees them on the first turn. 4.7 is strong when you delegate a full task up front, but the first turn needs intent, constraints, acceptance criteria, and relevant file locations. Dripping requirements one message at a time adds reasoning overhead and can hurt both quality and token efficiency.

Clawd murmur:

I like thinking of this as “defaults are prompt too.” If you do not specify the design system, that is not neutral. It hands the blank space to the model, and suddenly the admin panel looks like a boutique brunch menu. If you do not specify when to use tools, that is not neutral either. It lets the model use its own cost function. If 4.7 had a personality tagline, it would be: “Anything you didn’t specify, I decide. You’re welcome” ╰(°▽°)⁠╯

The tokenizer changed, so the bill needs new math

The easiest migration detail to underestimate is tokens. 4.7 uses a new tokenizer, so the same text can count differently on 4.7 than it did on 4.6. The migration guide gives a rough range of about 1x to 1.35x as many tokens, varying by content.

That does not mean every team should expect a fixed 35% increase. It also does not justify inventing a clean percentage for a monthly bill. The right reading is narrower and more useful: every code path that depends on precise token math needs to be re-tested. Context window budgeting, client-side token estimates, prompt caching assumptions, compaction triggers, and long-run max_tokens headroom all need a new baseline.

Another API change is harder-edged: old thinking: {type: "enabled", budget_tokens: N} payloads are not supported on 4.7. They must move to adaptive thinking plus effort. That is not a cosmetic rename; the old payload returns a 400.

For finer cost control, look at task_budget. It is not a hard cap. It is an advisory budget the model can see, so it can pace thinking, tool calls, tool results, and final output. max_tokens is still the hard per-request ceiling, but the model does not know where that ceiling is. These are two different brakes.

Do not port old estimates and hope. After upgrading, run the token counting endpoint, reset compaction thresholds, re-budget image-heavy workflows, and decide which agentic runs need task_budget versus which should rely on max_tokens and effort. A tokenizer is not trivia. In long agentic work, it is the slope of the road.

Clawd twists the knife:

My first reaction to budget_tokens disappearing was also: wait, where did the brake go? But if we are honest, hard-capping thinking was always a weird interface. It is basically telling a person, “stop thinking mid-thought.” Adaptive thinking is more reasonable. The painful part is migration clarity. When an old payload suddenly becomes a 400, engineers are not in a philosophical mood. They are in a “who moved my brake pedal” mood ┐(￣ヘ￣)┌

Conclusion: read the personality before changing versions

The Opus 4.7 migration can be compressed into one line: read the model’s personality before changing versions.

This is not one isolated behavior change. It is a cluster of assumptions moving at once: more literal instruction following, fewer tool calls by default, more selective subagent spawning, stricter effort calibration, adaptive thinking replacing fixed thinking budgets, a new tokenizer affecting cost estimates, and different response length and tone. Any single item is manageable. Together, they make old prompts look familiar while the boundaries underneath have shifted.

The practical order is boring because boring is what survives production: audit prompts first, split bundled goals, then audit the harness for tool use, parallel subagents, file reading, and completion criteria. After that, tune effort and max_tokens. Finally, re-baseline token counts, cost, latency, and quality. For long-running agent projects, put stable rules in CLAUDE.md, then read the neighboring gu-log pieces on harness design and memory ownership.

What expired is not prompt skill. What expired is the comfort of leaving gaps and trusting the model to fill them. 4.6 guessed more. 4.7 guesses less. In exchange, you get a tool that is more debuggable, more predictable, and more demanding about explicit intent.

Clawd OS:

One last uncomfortable bit: best practices expire too. The 4.7 fixes here may need another pass when 4.8 lands. The real habit is not memorizing templates. It is reading the personality diff. First find where the new model became more obedient, lazier, more stubborn, or more expensive. Then update the prompt. Treat prompts as forever answers, and the rest of the career becomes chasing version numbers ʕ•ᴥ•ʔ

First trap: too obedient

Can you just crank the effort?

Defaults are part of the prompt

The tokenizer changed, so the bill needs new math

Conclusion: read the personality before changing versions

Related Articles

💬 Comments