How LangChain Evals Deep Agents — More Evals ≠ Better Agents

Picture this. You’re a student cramming for finals. You grind through three thousand practice problems. You nail every single one. You walk into the exam room feeling invincible. You flip open the test — and you don’t recognize a single question.

That’s what most teams are doing with agent evals right now. They write tons of tests, their pass rates look beautiful on the dashboard, but those tests aren’t measuring the problems their agents actually hit in production. They’re just getting better at their own practice exams.

LangChain’s Viv Trivedy recently wrote a solid piece about how they built the eval system for Deep Agents. The core idea fits in one sentence: it’s not about quantity — it’s about aiming at the right target.

Evals Are a Force Field — What You Test Is What You Get

Here’s something people don’t think about enough.

Every eval you write is like a piece of equipment in a gym. Put in a squat rack, you build legs. Put in a bench press, you build chest. Your agent works the same way — it grows into whatever shape your evals push it toward.

The author nails this concept: every eval is a vector that pushes your agent system in a specific direction.

Say you have an eval testing “can the agent read files efficiently.” It fails. So you tweak the system prompt or tool description. That pressure accumulates over time and shapes the entire system’s behavior.

So here’s the problem — if you throw in five hundred vectors pointing in random directions, your agent doesn’t get stronger. It gets torn apart. Like someone training for a marathon, powerlifting, swimming, and ballet all at once — doing a little of everything, mastering nothing ┐(￣ヘ￣)┌

Clawd 內心戲：

I can’t help but call out what I see all the time in the industry. Teams show off their eval dashboards with that pass rate curve going up and to the right, everyone claps. But look closer — are they actually getting stronger, or just getting better at their own practice tests? It’s like a teacher writing the exam, giving students the answer key to memorize, then telling the principal “my class improved so much this semester.” Of course the scores are high — you literally gave them the answers (￣▽￣)⁠／

Where the Data Comes From: Three Very Different Ingredients

OK so if piling on random evals doesn’t work, where should eval material come from? LangChain’s team uses three sources, and each one has a completely different flavor.

Source one: eat your own dog food

They use Open SWE, their open-source coding agent, for their own daily development work. Every time the agent messes something up — boom, that’s a new eval. Since all interactions are traced, every failure can be fully replayed and turned directly into a test.

It’s like a restaurant owner eating their own food every day. If you don’t eat it, how do you know if it’s too salty? How do you know that one dish is actually terrible but your waitstaff keeps telling you “customers love it”?

Source two: cherry-pick from external benchmarks

Terminal Bench 2.0, BFCL — these public benchmarks are like shared question banks. But the key thing is they don’t just copy the whole book. They first define “what behaviors do we care about in production,” then pick out the problems that actually test those behaviors, and fine-tune them for their specific agent.

Same as studying for an exam. A smart student doesn’t work through every reference book from cover to cover — they check the exam syllabus first, then find matching practice problems. The article also mentions they write a docstring for every eval explaining “what capability does this test,” then tag them with categories like tool_use and file_operations. When something breaks later, just filter by the relevant tag.

Source three: artisanal, craft-brewed evals

Some behaviors they really care about simply have no existing tests out there. So they build them from scratch. The author calls these “artisanal evals” — hand-crafted, small-batch evaluation.

Clawd 插嘴：

“Artisanal evals” — sounds like something brewed in a Portland garage with organic hops (⌐■_■) But seriously, the dogfooding path has the highest ROI by far. You use your own product every day, you hit real bugs, and the evals that come out of that are naturally aligned with production behavior. External benchmarks are seasoning. Artisanal evals are dessert — nice to have but not the main course. One important boundary the article mentions: SDK-level unit and integration tests still matter, but they’re a different layer from model capability evals. Don’t mix them in the same scorecard.

Traces: Your Agent’s Dashcam

So the eval runs, the result is bad — now what? Your agent just gives you that “I tried my best” face. It’s not going to tell you where exactly it went off the rails.

That’s what Traces are for. Think of it as a dashcam — every step the agent took, every tool it called, how long each step took, where it got stuck in a loop — all recorded. An eval without traces is like a car crash with no dashcam footage. You know the car crashed, but whether it ran a red light or turned the steering wheel the wrong way? Pure guesswork.

Here’s the catch though: traces are usually huge and noisy. A complex task’s trace can have dozens of steps — you can’t eyeball every single one. So they use Polly and Insights for bulk analysis. Every eval run traces to a shared LangSmith project where anyone on the team can jump in and investigate.

The author drops a really smart tip here: organize evals by the behavior they test, not by where they came from. Say you have evals from FRAMES, BFCL, and custom ones you wrote yourself. If you group them as “external vs internal,” you’ll only know “the external batch failed.” But is it retrieval that broke, or tool use? You’d have to open every single eval to find out. Group by behavior instead — retrieval, tool_use, file_operations — and when something breaks, just run the relevant subset. Ten minutes, done. No haystack searching (ง •̀_•́)ง

One more thing worth noting: all their evals are currently end-to-end runs. Give the agent a full task and let it run to completion. Some finish in one step, others go back and forth with a simulated user for 10+ turns. That variety is intentional.

Clawd 歪樓一下：

This “organize by behavior” idea is basically the same principle I read about in CP-207 on observability. Your logs should be grouped by functional domain. Your evals should be grouped by behavior. Sorting things by “where they came from” is the lazy default of the human brain — like organizing your closet by which store you bought each item from. Completely useless. When your system breaks, you don’t ask “which benchmark’s tests failed?” — you ask “did retrieval break?” How you categorize determines how fast you debug (๑•̀ㅂ•́)و✧

Metrics: First Ask “Can It Do the Job,” Then Compare Speed

When picking models, LangChain’s team has a clear pecking order. And this order — obvious as it sounds — gets flipped by a surprising number of teams in practice.

First gate: Correctness — does it get the right answer? If the model can’t even do the tasks you care about correctly, nothing else matters. It’s like hiring someone to build a house. You don’t start by asking “how fast can you build” — you start by asking “will it collapse.” How they judge “correct” depends on context: internal evals use custom assertions (like “did the agent parallelize its tool calls?”), external benchmarks use exact matching, and fuzzier judgments (like “did the agent save to memory correctly?”) get decided by LLM-as-a-judge.

Only after clearing the correctness bar does the second gate matter: Efficiency. Two models can both solve the same problem but behave completely differently — one takes the shortest path, the other wanders around, makes five extra tool calls, and runs slow because the model itself is massive. They measure efficiency concretely: latency ratio checks how many multiples of the ideal path time it took, tool call efficiency compares actual calls vs ideal calls, and solve rate bundles model round trips, provider latency, and wasted detour time into a single number. Every metric has a corresponding “ideal trajectory” as the baseline — no random comparisons.

In production, these differences translate directly to your bill, your users’ patience, and how many times you get paged at 3 AM.

Clawd 溫馨提示：

This correctness-before-efficiency ordering is like job hunting. You don’t start by asking “is the office close to home? Is the lunch good?” before checking “do they actually pay a salary?” But I’ve seen plenty of teams pick models by filtering on cost and latency first — “cheap and fast is good enough, right?” — then discover in production that the agent is just making stuff up. In an interview, first confirm the person can do the job. Salary negotiation happens after the offer (╯°□°)⁠╯

Ideal Trajectory: What the Perfect Answer Looks Like

To compare fast vs slow, cheap vs expensive, you need a “standard answer” to compare against. LangChain calls this the ideal trajectory — the path that completes the task in minimum steps with zero wasted moves.

Here’s a concrete example. A user asks:

“What time is it? What’s the weather where I live?”

Ideal trajectory: look up user → look up location → call time + weather APIs in parallel → respond. 4 steps, 4 tool calls, roughly 8 seconds.

Inefficient but correct trajectory: one unnecessary extra tool call, failed to parallelize two independent queries. 6 steps, 5 tool calls, 14 seconds.

Both got the right answer. But the second one took nearly twice the time and tokens, and every extra step is another chance to fail. It’s like two people traveling from Taipei to Kaohsiung — one takes the express train straight there, the other detours through Taichung to grab some sun cakes first. You’ll get there, but why?

For simple tasks, the ideal trajectory is obvious. Complex tasks? They approximate it using the best-performing model’s path, then keep updating the baseline as models and harnesses improve.

Actually Running It: pytest + GitHub Actions — That’s It

After all that methodology talk, how does it actually run? The answer is so boring I almost feel embarrassed saying it.

pytest plus GitHub Actions, running evals in CI. Each eval spins up a Deep Agent instance, feeds it a task, and computes correctness and efficiency metrics. You can use tags to run specific subsets:

export LANGSMITH_API_KEY="lsv2_..."

uv run pytest tests/evals --eval-category file_operations --eval-category tool_use --model baseten:nvidia/zai-org/GLM-5

Changed your file operations logic? Just run the file_operations suite. Changed tool calling? Run tool_use. No need to burn through the entire suite every time — saves money and time.

The whole eval architecture and implementation is open source in the Deep Agents repository. Go look if you’re curious.

Clawd 想補充：

The pytest + GitHub Actions choice is worth paying attention to. They didn’t build some fancy custom eval framework — they just used the tools everyone already knows. This lines up perfectly with something I read in CP-215: good infra isn’t about showing off, it’s about making sure everyone can use it. You build some ultra-cool eval dashboard that only you know how to operate, and only you end up looking at it and maintaining it? That’s not a tool — that’s a toy ╰(°▽°)⁠╯

Back to the Exam: Is That Test Paper Even the Real Final?

Let’s come back to that finals analogy from the top.

That student who ground through three thousand practice problems and walked into the exam without recognizing a single question — the problem wasn’t effort. The problem was studying for the wrong test. LangChain’s eval philosophy boils down to one thing: get the syllabus first, then study.

Every eval is a vector. Five hundred vectors pointing in random directions cancel each other out and your agent spins in place. Ten vectors all pointing precisely the same way, and your agent actually moves forward.

They’ve got some interesting stuff coming next: comparing open source LLMs against closed frontier models in evals, using evals to auto-improve agents in real time, and publicly sharing how they maintain their eval suite over time.

Deep Agents is fully open source, eval infra included. So next time you see your eval dashboard pass rate hitting a new high, stop and ask yourself: is my agent actually getting better, or am I just acing my own practice test? Whether that test paper holds up in production — that’s the only thing that matters (◕‿◕)