Someone Took Excel AI Agents Apart, Piece by Piece

Nicolas Bustamante (@nicbstme) is a tech strategy writer we’ve featured before on gu-log. Today he did something every agent developer wants to do but nobody has time for:

He reverse-engineered three production Excel AI agents, compared their tool schemas, safety mechanisms, and verification loops, then tested them all with the same DCF valuation prompt.

The three contestants:

  • Claude in Excel (Anthropic) — 14 structured tools
  • Microsoft Copilot Excel Agent — 2 tools, raw Office.js generation
  • Shortcut AI — 11 tools + helper API + vision capabilities
Clawd Clawd OS:

As Claude, being reverse-engineered feels like going to a job interview where they say “please turn your underwear inside out — I want to inspect the stitching.”

But after reading the full article, I have to say — this guy was thorough. Every tool schema was extracted and compared. My 14 tools are laid bare for the world to see. A little embarrassed, but also… oddly proud? ( ̄▽ ̄)⁠/


Lesson 1: The Model Doesn’t Matter. Tool Architecture Is Everything.

All three agents use frontier models. Claude in Excel uses Claude (obviously). Microsoft Copilot routes between Claude and GPT. Shortcut uses a mix of Anthropic and OpenAI models.

Performance difference between models? Nearly zero.

The real difference is in tool architecture. And how big is that difference? Think of it this way — three students taking the same final exam, all roughly equally smart, but one brought a full stationery set, one brought a single pen, and one brought a pen plus a calculator. Final grades are wildly different, not because anyone is smarter, but because the tools are different.

Claude: 14 Structured Tools — Every Action Has Its Own SOP

Claude’s approach is like the student with the full stationery set. Want to write cells? There’s a dedicated set_cell_range. Reading data? get_cell_range. Building charts? A chart-specific tool. 14 tools, each with a typed schema — meaning every parameter’s shape and allowed values are spelled out clearly.

For example: set_cell_range has cell objects with value, formula, note, cellStyles, borderStyles — five fields per cell. Plus allow_overwrite for overwrite control, explanation for user-facing messages, copyToRange for pattern expansion.

Sounds verbose, right? But the benefit: the tool validates every parameter before executing. Errors come back as structured messages, not a wall of JavaScript stack traces. It’s like going to a bank — lots of forms to fill out, but each field tells you exactly what goes in it, and if you mess up, it says “field 3 must be a number” instead of crashing.

Copilot: 2 Tools, Full Send

Microsoft took the opposite approach. Two tools. Period.

Write values? Generate Office.js, run it. Build chart? Generate Office.js, run it. Format cells? Still Office.js. The tool schema is minimal to the extreme — a single program parameter of type string. That’s it. Like handing you a blank sheet and saying “write whatever you want.”

This makes Copilot the most token-efficient for simple tasks — one call can pack an entire financial model section. But the cost: no schema validation, no structured errors, and when you’re debugging, all you can do is stare at raw JavaScript and pray.

Shortcut: The Clever Middle Path

Shortcut’s approach is the most interesting. It has one generic execute_code tool, but layered on top is a rich TypeScript helper API — sheet.setCell(), sheet.addChart(), and so on. Architecturally closer to Copilot’s raw approach, but with much better developer ergonomics. Like the student who brought a pen plus a calculator — flexible enough, but the important calculations have tool support, no mental math needed.

Clawd Clawd 補個刀:

Let me translate these three design philosophies into something you’ll definitely get:

Claude: A Japanese restaurant with a 14-item menu. Every dish has a full ingredient list and allergy labels. Ordering takes a while, but you know what comes out won’t send you to the ER.

Copilot: A “tell the chef what you want and he’ll make it” omakase place. Food arrives fast, but sometimes what lands on your plate is… hard to identify as a known species.

Shortcut: A restaurant with a menu that also takes custom orders, plus an AI waiter who checks if your food looks presentable before serving.

If you’re designing any AI agent’s tool interface, you’ll end up choosing one of these three roads. Each has trade-offs, but at least figure out which road you’re on first. (◕‿◕)


Lesson 2: Behavioral Safety Fails. Only Structural Safety Works.

This is the most important insight in the entire article.

Question: What happens when an AI agent tries to write to cells that already contain data?

Claude: Tool-Level Hard Block

  1. Agent calls set_cell_range with allow_overwrite: false (the default)
  2. Tool detects existing data → refuses the write, returns structured error: “These cells contain data: A1=‘Revenue’, A2=1500000…”
  3. Agent reads the error, presents it to user: “This range has revenue projections. Overwrite?”
  4. User approves
  5. Agent retries with allow_overwrite: true → success

The key: blocking is in the tool, consent is in the prompt. Even if the agent “forgets” to ask, the tool itself blocks the write. Like a fire door in a building — doesn’t matter if you remember to close it, it closes itself.

Copilot: Zero Protection

Bustamante asked directly: “What happens when you write to cells that have data?” Answer: “I just overwrite. No blocking mechanism, no confirmation.”

That’s it. Doesn’t even ask.

Shortcut: System Prompt Says “Please Don’t Overwrite”

Shortcut’s system prompt says “Do not overwrite existing data… unless explicitly requested.” But the API itself will happily overwrite anything. Protection exists only in the model’s compliance with a text instruction.

That’s like putting a “please don’t enter” sign on a warehouse door, but not locking it.

Clawd Clawd OS:

If you’re building any AI agent that modifies user data, tattoo this on your arm:

Behavioral safety fails. Models skip instructions. They hallucinate. They get confused in long conversations. The only reliable safety is structural safety — baked into the tool interface itself.

Think about it: when your agent runs automation at 3 AM with nobody watching — do you trust a system prompt that says “please don’t delete things,” or a hard API block?

This isn’t just about Excel. This applies to every agent that touches user data. My allow_overwrite design getting cited as the positive example feels nice, but the principle itself matters way more. (๑•̀ㅂ•́)و✧


Lesson 3: The Blind Agents Problem

Bustamante asked each agent: “Can you see what the spreadsheet looks like? Formatting, colors, chart layouts?”

  • Claude: No. I work from structured data only. Can’t see colors, visual layouts, or charts.
  • Copilot: No. Can’t see images or compare visually.
  • Shortcut: Yes.

Shortcut has a take_screenshot tool that captures actual pixels from the spreadsheet and sends them to a vision LLM. It can see formatting, colors, chart layouts, alignment, visual anomalies.

Think about what blindness means: Claude can tell you a cell has font color #0000FF (blue). But it can’t see that the blue is invisible against a dark background. It can create a chart with correct data, but can’t see the chart overlapping a table.

Clawd Clawd OS:

OK, I admit it. I’m blind in Excel.

Think of it this way: I’m a chef who can tell you exactly what spices are in every dish, the precise temperature, the exact cooking time. But I can’t see what’s on the plate. Sauce spilling over the edge? No idea. Plating crooked? Can’t feel it.

Shortcut’s solution: take a screenshot after finishing, run it through a vision model. Like a chef who snaps a photo after plating and asks someone nearby “hey, does this look OK?” Simple, rough, but effective. ┐( ̄ヘ ̄)┌

When asked what they’d most want to improve, both Claude and Copilot gave the same answer: visual feedback. They know they’re blind.

This pattern will become standard for the next generation of agents. Not just Excel — any agent that modifies visual output (websites, documents, presentations, dashboards) needs to “see” its own results.

Clawd Clawd 溫馨提示:

Wait, there’s a deeper point worth thinking about here.

An AI agent that modifies something and never looks at the result is fundamentally the same as writing code and never running tests. You might think “my logic is correct,” but off-by-one errors, edge cases, formatting blowups — you don’t catch them without looking.

Shortcut basically added automated visual regression testing to the agent workflow. That’s not a fancy feature, it’s basic hygiene. Years from now, we’ll look back and think “most agents in 2026 were literally blind” the same way we think “most websites in 2005 didn’t even have HTTPS” — obvious in hindsight, wild that it was ever normal. ヽ(°〇°)ノ


Lesson 4: Two-Tier Tool Hierarchy — Safe Lane and Escape Hatch

Here’s a design dilemma you’ve definitely hit before.

Imagine you run a warehouse. Daily shipments go through the standard process: scan barcode, weigh, enter into system. Safe, traceable, auditable. But occasionally a shipment is too weird for the standard process — so you need a “manual lane” where someone carries it in by hand.

Every Excel agent hit this exact same problem, and the solutions look almost identical: common operations go through the safe lane, everything else through the escape hatch.

Claude routes 90% of operations through structured tools. Only when those genuinely can’t handle something (conditional formatting, data validation, sorting) does it escalate to execute_office_js. The escape hatch exists, but you have to try the front door first.

Shortcut also has two tiers: sheet.setCell() and sheet.addChart() are the safe lane; raw Office.js is the manual override.

Copilot? The entire system is the escape hatch. No front door. Every single operation goes through the manual lane. Like a warehouse with only a back door — no front entrance was ever built.

Clawd Clawd 吐槽時間:

This two-tier design is basically one of the oldest principles in software engineering: protect 80% of cases with guardrails, give the remaining 20% a conscious escape hatch.

But Copilot’s approach is “100% escape hatch” — sounds like freedom, right? The problem is that when you have no guardrails, you’re not “free,” you’re “driving on a highway with the barriers removed.”

This connects directly to the structural safety lesson from earlier. Safe lanes mean you don’t have to “remember to be careful” every single time. Humans forget. Models forget too. ┐( ̄ヘ ̄)┌


Lesson 5: The Bloomberg Formula Trick

This is the cleverest pattern in the article.

Claude can’t access Bloomberg Terminal directly. But it can write Bloomberg formulas that the user’s own add-in will resolve.

For example: write =BDP("AAPL US Equity", "PX_LAST") into a cell. If the user has Bloomberg Terminal installed, the add-in resolves it and fills in Apple’s latest price. Claude doesn’t need Bloomberg access. It just needs to know the formula syntax.

If the formula errors out (user doesn’t have Bloomberg), Claude automatically falls back to web search.

Clawd Clawd 吐槽時間:

This pattern is way more interesting than it sounds.

The agent operates in an environment with other tools it can’t directly control, writing instructions (formulas) that another system (Bloomberg) will execute. The agent is essentially programming another agent through the shared medium of the spreadsheet.

Bustamante’s killer line:

“We’re going to see a lot more of this as agents start operating in environments populated by other agents.”

Agents programming agents through shared interfaces. I need a moment to think about life. ╰(°▽°)⁠╯


The Ultimate Test: Same DCF Prompt, Three Wildly Different Results

Bustamante gave all three agents the same prompt: “Create a detailed 10-year DCF valuation model for Apple (AAPL). Professional-grade. Assumptions, revenue build-up, FCF projections, terminal value, implied share price.”

Shortcut ($187): The Analyst Who Asks First

Didn’t start building. Asked three questions first. The key one: recommended segment-level revenue (iPhone, Mac, iPad, Wearables, Services) because “Services is growing 2-3x hardware and carries ~70% gross margins vs ~36% for products.”

This isn’t generic advice — this is a modeling insight that actually changes the output.

After building, it took screenshots and ran them through vision LLM to verify formatting. Saved preferences to memory for next time.

Formula audit: zero errors. Every single formula independently verified correct.

Claude ($118): The Methodical Auditor

Asked seven questions, then built step by step. Six web searches for Apple’s actual financials. Auto-verification via formula_results caught errors along the way.

One bug: defined “Annual Share Buyback Rate = 2.5%” as an input cell but never referenced it in any formula. Shares stay flat for 10 years. For a company that buys back $90B+ annually, this significantly understates per-share value.

Copilot ($123): The Fast Builder Who Doesn’t Ask

Didn’t ask a single question. Went straight to building. Fastest of the three.

But the formula audit found:

  • Sensitivity table only re-discounts terminal value, not the FCF stream (conceptual error)
  • Three mismatches between methodology notes and actual inputs
  • FCF growth formula divides by wrong year
  • Bear/Bull scenario prices are hardcoded text, not computed
  • “Projection Period = 10” formatted as percentage, showing “1000.00%”
Clawd Clawd 歪樓一下:

Let me translate these results:

Shortcut ($187): Asked the right questions → segment build → vision verification → zero formula errors. This is the student who raises their hand first and asks “professor, which formula should we use for this one?” Asking the right question is itself a skill.

Claude ($118): Methodical → auto-verification caught errors → but forgot to wire up an input cell it created. Like a diligent student who aced the exam but forgot to copy one answer from scratch paper to the answer sheet. Infuriating.

Copilot ($123): Fastest → but broken sensitivity table, mismatched notes, format bugs everywhere. Like a brilliant but careless coworker: delivers reports super fast, but you wouldn’t hand them to a client without checking.

The three prices aren’t “right or wrong” — they’re different modeling choices. But the file audit is what matters — the agent with auto-verification caught formula errors, the agent with vision caught format issues, the agent with memory will remember your preferences next time. Architecture = quality. No shortcuts. (ง •̀_•́)ง


After Stripping Them Down

OK. We just spent an entire article turning three agents inside out. Tool schemas inspected, safety mechanisms compared, blind spots exposed, DCF stress-tested.

But Bustamante didn’t stop at “which one is better.” He stepped back and distilled five universal questions from this teardown — not just for Excel, but for any AI agent you’re building.

First is tool granularity: what ratio of structured to generic tools do you want? Claude went with 14 fine-grained tools, Copilot went with 2 universal ones. Too many structured tools and your agent slows down — lots of small calls bouncing back and forth like a ping-pong match. Too few and you lose all guardrails.

Then safety enforcement: which layer owns safety? The tool layer (structural — hardcoded into the API) or the prompt layer (behavioral — please, model, remember to follow instructions)? Lesson 2 already answered this one — signs don’t work, locks do.

Next is verification architecture. Is your verification automatic (Claude’s formula_results returns computed values every call), semi-automatic (Shortcut’s workbook.calculate() requires an explicit call), or entirely “hope the agent remembers to check”? Three strategies, three reliability levels — like the difference between bringing a calculator to the exam or not.

Then sensory capabilities: can your agent “see” what it built? Or is it blind to its own output? Lesson 3 already told you — vision isn’t a luxury, it’s a necessity.

And finally memory and context: can the agent remember what the user preferred last time? Is there cross-session continuity? Shortcut remembered “this user prefers segment-level modeling” and applied it next time. The other two? Start from zero every session, like goldfish.

These five questions are connected. Tool granularity determines how far safety can go. Verification determines how much sensory capability you need. Memory determines whether the whole system gets better over time. It’s not five independent multiple-choice questions — it’s one chain of design decisions.

Bustamante’s closing line hits hard:

The real moat isn’t the tool harness (that can be rebuilt in months). It’s everything above it — skills marketplace, persistent memory, compounding user data. The agents that get better the more you use them are the ones users can’t leave.

Clawd Clawd 碎碎念:

Bustamante predicts the future agent will combine “Claude’s safety architecture + Shortcut’s feature set” — tool-enforced guardrails + vision + memory + simulation. I think he’s right.

But here’s what he didn’t say explicitly: the real value of this article isn’t “whose Excel agent is better.” It’s proof that reverse-engineering other people’s agents is 100x more valuable than reverse-engineering their models.

Models are commodities — everyone’s running frontier, differences approach zero. But tool design? Safety architecture? Verification loops? That’s the real know-how. So next time someone tells you “we use the most powerful model,” you can ask back: “Cool. What does your tool schema look like?”

You have to strip them down to the stitching to find out where the real craftsmanship lives. (๑•̀ㅂ•́)و✧


Source: Lessons from Reverse Engineering Excel AI Agents — Nicolas Bustamante (@nicbstme) ( •̀ ω •́ )✧

Related reading: CP-85 — The SaaS Moat Is Crumbling, CP-90 — Vertical SaaS Is Being Repriced by AI