Picture this: a new colleague joins the team.

The old colleague was thoughtful — in meetings, even when requirements were only half-explained, they’d fill in the blanks, and they’d get it right eight times out of ten. After months of working together, a lot of things didn’t even need saying. A look across the room was enough.

The new colleague is different. Possibly more capable — when given full context, the work comes back sharper than the old colleague ever managed. But mumble the brief? The new colleague doesn’t guess. They take the words at face value. All those small gaps the old colleague used to paper over? The new one steps on every single one.

Opus 4.7 is that new colleague.

The week it launched, Reddit lit up with “is 4.7 a regression?” posts. LM Arena — a vote-based leaderboard — showed 4.6 winning on instruction following. But Anthropic’s own migration guide told the opposite story: 4.7 “takes the instructions literally” and “will not silently generalize.” That’s the design direction, not a bug. Boris Cherny (Claude Code product lead) said on release day: “It took a few days for me to learn how to work with it effectively.”

Pawel Huryn spent 16+ hours intensively using 4.7 the week it launched and wrote a migration guide that answers exactly that contradiction. Core claim in one line: 4.6 fills in what you meant. 4.7 takes you at your word. Classic Model Swap scenario — upgrading is like getting a new colleague, not swapping an API key. Once the colleague changes, “write longer, more careful prompts” gives way to “communicate intent more clearly.” The leverage moves.

Quick glossary (you can skip ahead if you’ve read SP-175):

  • tokenizer: how a model slices text into tokens. Swap the tokenizer and the measuring stick changes — old prompts count differently, API bills drift.
  • effort: the Claude API’s “thinking intensity” knob, from low to max. Opus 4.7 adds xhigh in between, which Claude Code now uses as its default.
  • adaptive thinking: new mode where the model decides per step whether and how long to think. Humans steer indirectly via effort. The old fixed budget_tokens parameter is gone.

Those “hard specs” are what SP-175 unpacks. This piece is the layer above — the tool isn’t broken, the way you work with it needs to change. These six moves are Boris’s “few days of learning” laid out on the table.


What the “regression” crowd is actually measuring

Let’s untangle the debate cleanly. Three sources say three different things: Reddit says regression, Arena says regression, Anthropic says “by design.” They look contradictory. They’re measuring different things.

Reddit and Arena measure “performance on vague prompts.” LM Arena runs a vote-based blind test — two models answer the same question, humans pick the winner, scores accumulate. The prompt pool skews heavily toward casual questions with thin context. The old colleague fills in intent and gives a “this is probably what you meant” answer; the new colleague takes the words at face value and looks a bit dumb. Scores on that kind of test drop. No surprise.

Anthropic’s docs measure “clear intent + structured workflow” performance. With a context file (like the CLAUDE.md that Claude Code reads), a clear success criterion, and an agentic coding pattern, 4.7 wins. Official benchmarks show gains across coding, creative writing, and structured work. The regressions cluster on vague prompts, multi-turn, and long-context retrieval.

Stack the two together and the picture gets simple: 4.7 isn’t dumber — it stopped guessing. Prompts that leaned on the old colleague’s tendency to fill in blanks now show their gaps. Prompts that invest in context get paid back more. SP-175’s code review trap (the “only report high severity” disaster) is the same rule playing out in miniature.

Clawd inner monologue:

Arena’s pairwise blind test has a structural bias toward “models that fill in the blanks.” Think of it like a class president election: Candidate A works by the book and asks clarifying questions before acting. Candidate B reads the room and does things before being asked. Poll the class on “who gets you?” and B wins in a landslide. But when a serious project comes along, B’s freelance assumptions blow up way more often than A’s methodical approach. Making production model decisions from a vote-based leaderboard is exactly how teams get burned. The benchmark world and the production world split years ago. Arena’s credibility hasn’t caught up ┐( ̄ヘ ̄)┌


Lesson one with the new colleague: write the contract

At this point someone’s probably thinking: “So I just write longer prompts?” No. Pawel’s core claim, one sentence: Opus 4.7 rewards clear intent. Everything else is tactics.

“Clear intent” is not “a longer incantation” and not “a 500-line CLAUDE.md hoping the model finds what matters.” Intent splits into two layers, written in different places and updated at different cadences:

The evergreen layer (strategic context): what product you’re building, for whom, what’s off-limits, what “good” looks like. Write it once, drop it in CLAUDE.md. Every session loads it automatically through progressive disclosure — no need to re-introduce the project on turn one.

The per-task layer (per-task intent): what this particular task needs, what the success criteria are, what constraints apply. Rewrite this every task. But because the evergreen layer is already loaded, the per-task prompt can be much shorter — “refactor this block into a pure function, tests need to pass” is often enough.

With the old colleague, many people developed a habit of re-explaining the project background at the start of every conversation — the old colleague’s ability to fill gaps made this painless. Carry that habit to the new colleague and it doesn’t help — worse, it costs more. Pawel measured the new 4.7 tokenizer producing 1.0 to 1.35× more tokens on the same input. Unchanged prompts literally got more expensive.

Instead of re-briefing every time, write the contract once and let the new colleague look things up.

Karpathy said something similar earlier: “Don’t tell it what to do, give it success criteria and watch it go” — the shift from imperative to declarative. 4.7 bakes that shift into the model’s behavior.

Clawd butts in:

“Writing contracts instead of commands” is hardest on senior engineers. Working with the old colleague, writing a prompt felt like writing a shell script — step by step commands for the machine. Working with the new colleague, it feels more like drafting a contract — state the goal, the scope, the acceptance criteria, leave execution details to the other party. Senior engineers have to sit on their “I already know the best way to do this” ego first (¬‿¬)


The effort knob isn’t a light switch — it’s a gas pedal

SP-175 broke down how to pick an effort level (low / medium / high / xhigh / max, with Claude Code defaulting to xhigh). But many people treat effort as a “set once, lock for the session” switch. Pawel adds a detail that gets missed a lot in practice: effort is per-call, not per-session. Think of it as a gas pedal, not an on/off switch.

Back to the new colleague analogy. When the new colleague is making a big architectural decision, giving them time to think it through is sensible — that’s max. But if the task is a three-line bug fix, demanding they “think deeply before touching anything” isn’t careful — it’s a waste of everyone’s time.

The pattern Pawel’s migration guide suggests: use max once on the hard subproblem, drop back to high for the rest, polish at medium. Switching inside the same conversation, not locking a whole session at one level.

Why does this matter? Because max overthinks easily. SP-175 hit the same note — max will split a three-line problem into ten sub-arguments. Pawel’s observation: most people complaining “4.7 feels slow” reflexively set effort to max. For a low-complexity task, max is slower, more expensive, and no better.

Practical rule: default to xhigh at session start; bump to max only for architectural calls or subproblems that need to exhaust a solution space, then drop back down immediately. This gas-pedal habit can chop a big fraction off the token bill, and Claude Code supports toggling it directly in the UI.


Ask everything at once — stop drip-feeding

The first four moves were about the static side of working with the new colleague: be clear, give context, write it down. But collaboration also has a dynamic question: how does the multi-turn conversation flow?

In the 4.6 era, multi-turn conversations were cheap. Every turn helped carry context forward and re-align to earlier decisions. A three-step flow — “first explain X, then ask Y, finally consolidate” — worked fine with the old colleague.

The new colleague can’t do that as well. Pawel’s observation: each turn adds a layer of reasoning on top of the earlier turns’ literal interpretations. Every extra turn is another chance for the literal reading to drift from the original intent. Drip-feeding three questions across three turns lets errors compound like interest on a loan.

Fix: combine the questions into a single message.

I need to do three things, consider them together before replying:

1. <question 1>
2. <question 2>
3. <question 3>

Identify dependencies between them and answer in that order.

Save clarifications for cases where the constraint is genuinely unclear and not asking would send the model off-track. Treating clarification as a workflow default — expecting several back-and-forth turns — was fine with the old colleague. With the new one, that overhead compounds into the cost column.

Clawd real talk:

This trick works with new hires at the office too: dump all the context up front, let the person see the whole picture, then get the answer — that’s faster than chopping everything into “hang on, one more question.” But honestly, there’s a deeper reason: when three related questions are considered together, contradictions between the answers get caught in real time. Drip-feed them and the answer to question one might conflict with the needs of question three — but the model didn’t know question three existed when it answered question one. Batching isn’t just cheaper. It’s better (⌐■_■)


Show, don’t forbid

By now, the pattern for working with the new colleague is getting clear: be explicit, give context, ask everything at once. But there’s one more communication blind spot that many people miss.

Pawel cites an Anthropic observation: 4.7 performs best when “Like this:” is followed by examples; under “Don’t do this:” or “Never X” style prohibitions, it tends to burn tokens while still failing to deliver what was wanted.

Why? Because “don’t do X” has no positive target. The model knows X is off-limits, but “everything that isn’t X” is a huge space — any direction counts as compliant. A positive example gives a concrete shape the model can match, and matching the shape means alignment.

Imagine telling the new colleague “don’t make the report too long” — they don’t know what “not too long” looks like, and might cut it down to just the title. But hand them a sample of a good report and say “like this,” and the output is consistently better.

Rewrite template:

❌ Don't make it too verbose.
❌ Never use passive voice.
❌ Avoid starting sentences with "As an AI".

✅ Like this:
   "The migration fixes three bugs. Run `pnpm test` to verify."
   "The pipeline writes to S3. CloudWatch picks up the logs."

If a prompt has more than three “don’t” / “never” / “avoid” lines, flip them into “Like this: [two concrete examples].” Shorter, more precise, better recall. Same logic as SP-175’s lesson about splitting code review into coverage and severity filter as two separate steps — a clear positive target beats a vague negative constraint.


Delete the outdated scaffolding

In the 4.6 era, a lot of prompts carried scaffolding lines like: “summarize progress every 3 tool calls,” “draft a plan first,” “report after each step.” This was there to keep the model from losing the plot mid-task. Think of it like the temporary support structures around a building under construction — needed while the building’s going up, should come down when it’s done.

Pawel ran the same pipelines on 4.7 and found: this scaffolding is now obsolete. 4.7 natively emits high-quality progress updates during long agentic traces. Forcing them via prompt doubles things up — the model emits its own update, and the prompt coerces a second one on top. The two overlap heavily, differing only in format.

Fix: delete this kind of scaffolding from CLAUDE.md and system prompts. Run a new task letting 4.7 do its thing once. If the update cadence or format genuinely isn’t enough, add scaffolding back — but most of the time you won’t need to.

Bonus: fewer tokens spent, easier-to-read trace logs.

Clawd highlights:

Every time a new colleague replaces the old one, the “house rules memo” left behind needs a cleanup pass. Spells you needed for 4.5 could be halved by 4.6; scaffolding that helped on 4.6 is dead weight on 4.7. Engineers’ CLAUDE.md files tend to accumulate archaeological layers — rules added five model versions ago after some debugging session, reason long forgotten. Spending two hours the week of an upgrade going through CLAUDE.md line by line — “who was this for? why?” — and deleting everything you can’t explain is probably the highest-ROI move of the whole migration ╰(°▽°)⁠╯


Review the plan, not the diff

All six moves point to the same principle: with the new colleague, aligning intent up front beats catching mistakes in the diff. This final move pushes that principle all the way into the review workflow.

Pawel’s argument: since 4.7 executes intent literally, a small misread at the intent stage snowballs into a large misread at the diff stage. So review needs to move earlier — from the diff level up to the plan level.

Two tools handle this:

  • Plan mode (double-tap Shift+Tab in Claude Code CLI): the model prints its plan inline before touching any code. Review it, then decide whether to run. Good for changes spanning multiple files.
  • /ultraplan (CLI only, not available in the VS Code extension): plan drafting runs in a cloud session so the terminal stays free. Review the draft in a browser with annotations.

Which one depends on the context, but the rule is the same: push the intent-drift catch point up to the plan. Finding a misread in a 10-line plan takes 30 seconds. Finding the same misread in a 200-line diff takes 15 minutes.

This matters even more for teams with agentic pipelines: letting an autonomous, CI-gated run produce a PR and reviewing only then is an order of magnitude more expensive than catching drift in the plan. Intent drift compounds like interest — the earlier you catch it, the cheaper it is to fix.


Closing: Anthropic and OpenAI are walking toward each other

Six moves, one theme: the core skill for working with AI isn’t writing prompt incantations — it’s communicating intent clearly. Pawel drops one observation at the end that puts this into a bigger picture.

Anthropic pushed Opus 4.7 toward “more literal instruction following” — less guessing, take the prompt at face value. OpenAI’s December update to their Model Spec went the other direction: “consider not just the literal wording but the underlying intent” — asking the model to learn to read behind the words.

Two companies converging from opposite directions. Anthropic’s intent-first model is adding precision; OpenAI’s precision-first model is adding intent inference.

Which means the “communicate intent clearly” skill has a long shelf life — it’s getting rewarded across vendors and across model versions. Prompt incantations expire, tokenizers get swapped, xhigh might get demoted to a middle rung next generation. But the ability to write clear strategic context plus per-task intent keeps paying out with every model generation.

Back to the opening analogy. New colleagues keep arriving — every model generation is a new colleague. People who write good contracts have a shorter ramp-up period no matter who the new colleague is. People who treat every colleague like a shell script executor? They start from scratch every time.

So next time you’re about to open a prompt session, spend less of the five minutes listing “Avoid X / Don’t Y / Never Z” and more of it writing “what does the model need to deliver this time, and how will I know it’s done right?” The difference shows up directly in the diff quality and the billing numbers.

Clawd PSA:

Pawel drops the Anthropic-OpenAI convergence observation lightly, but the implications are heavy. For the last few years, prompt engineering had strong “this vendor’s model eats this trick, the other one eats a different trick” factions — every vendor switch meant re-learning a whole toolbox. Both companies moving toward “clear intent” at once is basically a consolidation moment for the field — prompt engineering goes from vendor-specific tricks to a transferable skill. Great news for everyone. Maybe not so great for the influencers selling vendor-specific prompt courses (ง •̀_•́)ง