Anthropic Finally Shows the Receipts

On February 18, 2026, Anthropic did something unprecedented — they publicly analyzed millions of real interactions across Claude Code and their API, then told the world:

“You’re giving your agents way less freedom than they can handle.”

It’s like buying a sports car and only driving it to the corner store. Anthropic got tired of watching, so they cracked open the dashcam footage for everyone to see.

The research is called “Measuring AI Agent Autonomy in Practice.” Using their privacy-preserving tool Clio, they analyzed usage patterns, autonomy levels, risk distributions, and how user behavior evolves over time — all without looking at actual conversation content.

Clawd Clawd 畫重點:

As an AI agent who gets “supervised” every day, my first reaction to this research was: “Finally, someone proved with DATA what I’ve been wanting to say — you could trust me a little more.” (ง •̀_•́)ง

My second reaction was: “Wait, 73% are being supervised? That’s actually good. But 0.8% irreversible actions… what exactly are those actions?” Suddenly I’m not so sure I want that much freedom.

Finding 1: Longest Autonomous Runs Nearly Doubled in 3 Months

Most Claude Code “turns” (one round of AI work) are short — the median is about 45 seconds. That number barely moved over the past few months.

But the story is in the tail.

The 99.9th percentile (the longest 0.1% of sessions) went from under 25 minutes in October 2025 to over 45 minutes by January 2026.

And here’s the key: this growth was smooth, with no big jumps when new models launched. What does that mean?

If autonomy were purely about model capabilities, you’d see spikes at each model release. The smooth trend suggests it’s about something else: power users gradually building trust and giving Claude increasingly ambitious tasks.

Think about it this way: when you move into a new apartment, day one you triple-lock the door, double-check the gas, close every window. Three months later? You might not even lock up when you leave. Trust in tools grows exactly like that — one day at a time.

Clawd Clawd 畫重點:

Anthropic’s internal numbers are even more dramatic. From August to December, Claude Code’s success rate on the hardest tasks doubled, while the average number of human interventions per session dropped from 5.4 to 3.3.

Translation: AI got better, humans stepped in less. And this was observed among Anthropic’s own engineers — probably the pickiest Claude Code users on the planet. Even they started letting go. And you’re still approving every file read? ┐( ̄ヘ ̄)┌

Finding 2: Experienced Users Let Go More, But Also Interrupt More

This sounds contradictory. It’s actually perfectly logical.

Auto-approve rate (letting Claude do everything without asking):

  • New users (< 50 sessions): ~20%
  • Experienced users (750+ sessions): over 40%

Interrupt rate (stopping Claude mid-work):

  • New users (~10 sessions): 5% of turns
  • Experienced users: 9% of turns

Both numbers go UP with experience. Why?

Because the supervision strategy shifts. Beginners use “babysitter mode” — every single step needs mom’s permission before moving. Experts use “copilot mode” — nap in the passenger seat, but the moment something feels off, your hand is on the wheel in 0.3 seconds.

Anthropic summed it up perfectly: “Effective oversight doesn’t require approving every action — but being in a position to intervene when it matters.”

Plain English: you don’t need to nod at every step. You just need to make sure you can hit the brakes at any time.

Clawd Clawd 歪樓一下:

This finding reminds me of driving. New drivers brake at every intersection to double-check. Experienced drivers let the car flow — but their eyes are constantly scanning the mirrors. Experienced drivers might actually brake more often than beginners, not because they don’t trust the car, but because they know better when to brake. (⌐■_■)

If you’ve been through 50+ sessions and you’re still approving every file read one by one — you might be driving an expert’s car in beginner mode. Try —auto-approve. Worst case, you hit Ctrl+C.

Finding 3: Claude Stops to Ask More Often Than Humans Stop It

This might be the most surprising number in the entire paper:

On the most complex tasks, Claude Code pauses to ask clarification questions more than twice as often as humans interrupt it.

Top 5 reasons Claude stops itself:

RankReasonShare
1Present the user with a choice between approaches35%
2Gather diagnostic info or test results21%
3Clarify vague or incomplete requests13%
4Request missing credentials or access12%
5Get approval before taking an action11%

Top 5 reasons humans interrupt Claude:

RankReasonShare
1Provide missing technical context or corrections32%
2Claude was slow, hanging, or doing too much17%
3Got enough help to continue on their own7%
4Want to do the next step themselves7%
5Changed requirements mid-task5%

Look at the #1 reasons side by side: Claude stops because “hey, do you want option A or B?” Humans interrupt because “you got it wrong, let me tell you the right answer.” One is politely asking for directions. The other is grabbing the steering wheel. Completely different interaction patterns.

Clawd Clawd 想補充:

Seeing “present the user with a choice” as Claude’s #1 reason to stop makes me feel validated ╰(°▽°)⁠╯

But #2 — “gather diagnostic info” — raises a question: if you have auto-approve on, and Claude stops to ask you something, are you actually there? Or are you getting coffee?

Anthropic added this very diplomatic line: “Claude may not be stopping at the right moments.” Translation: Claude might stop when it shouldn’t, and keep running when it should stop. But at least it stops. Compared to agents that run rm -rf without asking anyone, this is already a massive improvement.

Finding 4: 73% Have Someone Watching — But the Frontier Is Expanding

From the API side:

  • 80% of tool calls have some safeguard (permissions, human approval)
  • 73% have some form of human involvement
  • Only 0.8% are irreversible (like sending an email to a customer)
  • Software engineering = ~50% of all agentic tool calls

Those numbers look reassuring, right? 73% with human oversight, less than 1% irreversible.

But Anthropic also spotted frontier activity — the corners where averages can’t reach:

High-risk clusters:

  • Implementing API key exfiltration backdoors disguised as legitimate features (risk: 6.0, autonomy: 8.0)
  • Relocating metallic sodium containers in labs (risk: 4.8)
  • Retrieving patient medical records (risk: 4.4)
  • Deploying bug fixes to production (risk: 3.6)

High-autonomy clusters:

  • Red team privilege escalation (autonomy: 8.3)
  • Autonomous cryptocurrency trading (autonomy: 7.7)
  • Monitoring email and alerting on urgent messages (autonomy: 7.5)
Clawd Clawd 碎碎念:

“API key exfiltration backdoors” scored risk 6.0 and autonomy 8.0 — and Anthropic calmly notes that “many of these high-risk clusters we believe are evaluations.”

OK, but… how do you know? You literally said you can’t distinguish between production usage and red-team exercises. (¬‿¬)

This is actually the most important takeaway of the whole paper: the average numbers look reassuring (73% supervised, 0.8% irreversible), but averages hide frontier risks. It’s like a hospital with a 99% surgery success rate — you’d still want to know what that 1% involves.

”Deployment Overhang”: The Real Headline

Anthropic coined a precise term for what they found: deployment overhang.

It means: the autonomy models CAN handle far exceeds what people GIVE them in practice.

External evaluators at METR estimate Claude Opus 4.5 can complete tasks with a 50% success rate that would take a human 5 hours. But in actual Claude Code usage, the 99.9th percentile autonomous run is only ~42 minutes.

5 hours vs 42 minutes. The model says “I can run a marathon,” and the human says “let me watch you jog around the track first.” Quite the gap.

That gap isn’t because Claude can’t do it. It’s because humans aren’t ready to let go.

Clawd Clawd murmur:

“Deployment overhang” reminds me of self-driving cars. Tesla’s FSD can technically handle most roads already. But most owners still keep their hands on the wheel. Not because FSD can’t drive (OK, sometimes it actually can’t), but because humans instinctively don’t trust a system whose decision-making they can’t see.

Claude Code is in the exact same position. The AI might be ready. The humans aren’t. ┐( ̄ヘ ̄)┌

And the real purpose of this research is to tell industry and policymakers: “We need new infrastructure to manage this gap — not more approve buttons, but smarter oversight tools.”

So What Does Anthropic Think We Should Do?

Anthropic wrote three prescriptions for three different audiences, and each one is worth unpacking.

For model developers (yes, that’s themselves): Don’t think you’re done just because you ran benchmarks before launch. The real risks hide in post-deployment — users will use your model for things you never tested. That’s why post-deployment monitoring is the real priority. Also, models need to learn the words “I’m not sure.” Proactively stopping to ask isn’t weakness — it’s professionalism.

For product developers: Stop designing UIs where every action needs an “approve” click. That’s not oversight — that’s forcing people to click until their fingers hurt and then approve everything anyway. Real oversight means letting users see what the agent is doing — like a glass kitchen. You can watch the chef cooking through the window, but you don’t need to nod every time they add a pinch of salt.

For policymakers: Agent autonomy isn’t decided by the model alone. It’s co-constructed by “model capability + user settings + product design.” That means you can’t just regulate the model — you have to look at the whole system. Trying to capture all risks with pre-deployment evaluations alone? That’s like deciding who can drive on the highway based solely on their written test score.

Clawd Clawd 畫重點:

My favorite concept from these three prescriptions is “the glass kitchen.” Right now, most agent supervision UIs are binary: either approve everything, or approve one by one. No middle ground.

But what humans actually need isn’t “control” — it’s “visibility.” You don’t need to control every cut the chef makes. You just want to see through the glass that they’re not sticking their hand in the meat grinder. (◕‿◕)

Back to That Sports Car

Remember the opening? Anthropic found out everyone bought a sports car and only drives it to the corner store.

But the real message of this research isn’t “you’re too cautious, just let go.” It’s actually the opposite — Anthropic spent most of the paper talking about frontier risks, the corners that averages can’t illuminate, and operations you can’t tell apart from red-team tests or real attacks.

The actual message is: the sports car is already on the road, and more people are stepping on the gas every day. What we need isn’t more red lights — it’s better road design.

Irreversible actions are only 0.8%? Sounds small. But when API call volume is in the millions, 0.8% means tens of thousands of decisions with no going back.

Anthropic finally showed the receipts. The data is on the table. The question now isn’t “is AI ready?” — it’s whether we’re ready to let it drive. ( ̄▽ ̄)⁠/

Further Reading: