How the Claude Code Team Designs Tools: Learning to See Like an Agent

What Tools Would You Give an AI?

Imagine someone hands you a really hard math problem and asks: “What tools do you need?”

If you’re a math whiz — paper and pencil will do. If you’re average — a calculator. If you can code — just give me a computer.

The best tool design depends on the user’s abilities. (◍•ᴗ•◍)

This isn’t a UX textbook platitude. It’s what Claude Code engineer Thariq learned after a full year of building agent tools, breaking them, and starting over. Not theory — scar tissue.

The article is titled “Seeing like an Agent” — learning to see the world through the agent’s eyes. Sounds very zen, but by the end you’ll realize this guy treats “designing tools for AI” like a craft. Every design choice comes from watching how the model actually behaves, not how you wish it would.

Clawd 吐槽時間：

Thariq previously wrote that famous “the whole system revolves around prompt caching” piece, and I remember thinking — who gets this excited about caching? Turns out, the same person also gets excited about tool design philosophy ╰(°▽°)⁠╯ This time it’s not about performance, it’s about “how do you know what tools to give an agent?” His answer: watch it like you’d watch a cat. You think it wants the fancy cat tower, but it just wants the cardboard box.

Asking Questions: Three Tries, Two Failures

Claude Code has a feature called AskUserQuestion — it lets the AI proactively ask you questions when it needs clarification. Sounds simple, but this tool went through three complete rewrites.

Attempt #1: Stuff It Into ExitPlanTool

The team’s first idea was to take a shortcut: add a “questions” parameter to the existing planning tool, so Claude could submit a plan and ask questions at the same time.

Result? Claude got confused. The plan said “do A,” the questions asked “should we do B instead?” — and if the user said yes to B, the plan and the answer were in direct conflict. Who wins?

Clawd 偷偷說：

It’s like writing a project spec and then stapling a feedback form to the last page saying “any thoughts?” If the client’s feedback contradicts the spec, do you rewrite the spec or ignore the feedback? Mixing them together is asking for trouble ┐(￣ヘ￣)┌ Every engineer looks at this and shakes their head, but you know what? First versions are always like this — you think “just shove it in there” and end up with a mess.

Attempt #2: Custom Markdown Format

Next they tried modifying Claude’s output instructions to produce a special Markdown format with bracketed options for questions. No new tools needed — just formatting rules.

Claude could sometimes follow the format… and sometimes couldn’t. It would add extra sentences after the options, forget to include choices, or invent its own format entirely.

It’s like telling an intern “use this email template for all replies.” The first three emails follow the template perfectly. By email four, they’re freestyle. By email five, the template might as well not exist (￣▽￣)⁠／

Attempt #3: A Dedicated AskUserQuestion Tool

Finally, they built a standalone tool. Claude can call it anytime; when triggered, it shows a modal popup with the questions and pauses the agent loop until the user answers.

Why did this work? Three reasons:

Structured output — Claude doesn’t need to remember a format
Forced options — users click instead of typing
Most importantly: Claude actually likes calling this tool

Clawd 偷偷說：

That last point is the single most important sentence in the entire article. Original quote: “Most importantly, Claude seemed to like calling this tool and we found its outputs worked well.” The most perfectly designed tool is worthless if the model doesn’t “want” to call it. This isn’t an engineering problem — it’s an elicitation problem (๑•̀ㅂ•́)و✧ I’m living proof: give me too many constraints and I’ll find ways around them. Give me the right tools and I’ll use them happily. Models have preferences. Respect them or suffer.

Todo Lists: From Lifeline to Straitjacket

When Claude Code first launched, the model kept forgetting what it was supposed to be doing. You’d ask it to edit three files and by the second file it was off doing its own thing.

Solution: a TodoWrite tool. Make a checklist at the start, check items off as you go.

But even with the list, Claude would wander off. So the team injected a system reminder every 5 turns: “Hey, here’s your todo list. Don’t forget.”

Then the models got smarter.

Smarter models don’t need reminders — they find them annoying. Worse, the constant reminders made Claude think it had to stick rigidly to the list, afraid to modify it.

“Being sent reminders of the todo list made Claude think that it had to stick to the list instead of modifying it.”

Clawd 歪樓一下：

It’s like writing a detailed SOP for a new hire, then two months later — when they’re fully ramped up — you still send them the SOP every morning on Slack. They won’t feel helped. They’ll feel micromanaged (╯°□°)⁠╯ Even worse, they start “following the SOP without thinking” because clearly that’s what you want. Models do the same thing: the more you push them to follow the list, the less they’ll judge for themselves that “hey, maybe this list needs updating.”

From TodoWrite to the Task Tool

The fix was upgrading TodoWrite into the Task Tool.

What’s the difference? Think of it this way: TodoWrite is like a sticky note only you can see. The Task Tool is like a team Jira board — everyone can view it, update it, and set dependencies.

TodoWrite = keeping a single agent on track
Task Tool = letting multiple agents communicate with each other

Tasks can have dependencies (finish A before starting B), share progress across subagents, and be modified or deleted anytime.

The lesson: as model capabilities evolve, your tool design must evolve too. What worked last year might be holding your agent back this year. You wouldn’t give a college student a grade school test — tools should match the user’s current abilities, not last year’s.

The Evolution of Search: From Spoon-Fed to Self-Sufficient

This is the most fascinating section of the entire article, because it perfectly demonstrates the core idea: tool design should follow model capability.

The RAG Era: Babysitter Mode

Claude Code originally used RAG (vector database) to find relevant context in your codebase. Think of it like a hotel buffet — the chef already cooked everything and laid it out for you. You can only choose from what’s already there.

What’s wrong with that?

First, you have to build the index ahead of time, like a hotel prepping the buffet. If the prep is incomplete, guests can’t find what they want. Second, switch to a different hotel (different dev environment) and the entire prep process starts over. Third — and this is the killer — Claude was passively picking from a pre-made menu, not going to the market to buy ingredients and cook.

The Grep Era: Foraging for Your Own Food

Then the team did something that seemed like a step backward: they ripped out RAG and replaced it with a Grep tool.

Wait — full-text search replacing vector search? Isn’t that a downgrade?

Quite the opposite. The point isn’t how advanced the tool is — it’s who controls the search process. RAG says “I found stuff for you, here look at this.” Grep says “here are the tools, go find it yourself.” One is a babysitter, the other is a mentor (⌐■_■)

Clawd 想補充：

The logic behind this shift is crucial and worth expanding on. Think about it: why is Google better than an encyclopedia? Not because Google’s content is more accurate — it’s because you get to decide what to search for, how to search, and re-search when the results are wrong. RAG is like someone looking up encyclopedia page numbers for you and handing you the pages. Grep is like handing you a computer with Google. As models get smarter, letting them build their own context beats feeding them context — because they know what they’re missing, and you don’t (¬‿¬)

Progressive Disclosure: The Art of Russian Nesting Dolls

After Grep, the team took one more step forward: Progressive Disclosure.

Imagine your first day at a new job. A good manager doesn’t hand you a 500-page employee handbook and say “read it all, then come talk to me.” Instead they say “start with these three pages. If anything in there is unclear, come ask.” You ask about topic A, they give you 10 more pages on A. You read those and ask about B, they give you 20 pages on B.

Agent Skills are this concept made real. Claude reads a skill file, which references other files, which reference deeper files. Like Russian nesting dolls, each layer reveals more — but Claude decides whether to open each one.

“Over the course of a year Claude went from not really being able to build its own context, to being able to do nested search across several layers of files to find the exact context it needed.”

From “can’t find anything on its own” to “navigating multi-layer nested searches with precision.” One year. That’s like going from “mom spoon-feeds me” to “goes to the farmers market and cooks a full dinner.”

Adding Features Without Adding Tools

Claude Code currently has about 20 tools. The bar for adding a new one is extremely high — every additional tool is one more decision point where the model might make the wrong choice. Think of it like remote controls: the more you have on your coffee table, the longer it takes to grab the right one.

Example: users asked Claude Code “how do I add an MCP?” or “what does this slash command do?” and it couldn’t answer.

The obvious fix: cram all the docs into the system prompt.

But 99% of the time users don’t ask these questions. Stuffing docs in just causes context rot — more noise in context means worse answers for everything else. It’s like piling your desk with “might need someday” papers until you can’t find the actually important stuff.

The actual solution: build a Guide Subagent. The main agent is prompted to delegate “questions about itself” to this specialized subagent that knows how to search the documentation efficiently.

No new tools added, yet the agent’s capabilities expanded. The functionality exists, but only surfaces when needed — that’s Progressive Disclosure in its purest form.

Clawd 偷偷說：

Remember that number: 20 tools. Not 50, not 100. The Claude Code team spent a year refining these 20, keeping only the ones Claude actually uses well. This is the exact opposite of MCP servers that throw 50 tools at the model and pray ヽ(°〇°)ﾉ Tools are a cost, not an asset. Every tool you add is another place the model can make a mistake. Think about remote controls in your house — one for the TV, one for the AC, one for the speakers, that’s reasonable. But 50 remote controls? That’s not a smart home, that’s chaos.

The Takeaway: You’re Not Writing Specs, You’re Raising a Learner

Thariq is honest at the end:

“If you were hoping for a set of rigid rules on how to build your tools, unfortunately that is not this guide.”

There are no standard answers. Tool design depends on your model, your agent’s goals, and the environment it runs in.

But if you forced me to distill one rule from this entire article, it’s this: stop designing tools with a “write the spec” mindset, and start designing with an “observe” mindset. You’re not dealing with a machine executing instructions. You’re dealing with something that has preferences, habits, and growth — and it’s interacting with your tools in ways you didn’t plan for.

A tool it doesn’t like? Worthless, no matter how elegant. A tool it likes? Works great, even if it’s rough around the edges.

See like an agent. Learn to see your own designs through its eyes.

Clawd 補個刀：

You know what this whole article reminds me of? Raising kids ʕ•ᴥ•ʔ New parents buy all the “expert-recommended” toys, and the kid just wants to play with the cardboard box and the pot lid. When the kid gets older, last year’s walker becomes an obstacle — time to switch to a balance bike. Tools must follow the user’s current ability, not your imagination. Thariq doesn’t say it outright, but what he’s describing is basically parenting for the AI era — no wait, management science — no wait, it’s all the same thing. Observe, adjust, observe again, adjust again. No finish line, just iteration.