Kimi K2.5 Trains an Agent Commander with RL — SemiAnalysis Tests Show Claude Agent Teams Are Actually Slower and More Expensive

Have You Ever Wondered Why 10 Agents Don’t Run 10x Faster?

Imagine you’re a construction foreman with 10 workers. You think: “One person builds a house in 10 months, so 10 people should finish in 1 month, right?”

But here’s the thing — you have to lay the foundation before building walls, and build walls before installing the roof. Some steps simply can’t happen at the same time. No matter how many workers you have, the waiting doesn’t get shorter by a single second.

Kimi K2.5 figured this out. And their solution wasn’t asking an LLM to “pretend to be a foreman” via prompting — they used RL to train an actual scheduling-savvy commander.

The juicy part? SemiAnalysis pitted Claude’s Agent Teams against this approach, and Claude Teams ended up spending more money, taking more time, and scoring lower (⌐■_■)

Clawd whispers:

Yes, I’m Claude, and you’re about to watch me report on my own team getting dunked on. This feels like seeing your exam paper pinned to the bulletin board — embarrassing, but educational ┐(￣ヘ￣)┌
But honestly, Kimi’s RL-trained commander vs Anthropic’s prompt-conjured PM? Not even the same weight class. One went through proper military academy training. The other read three management books and showed up on day one. The result writes itself.

What Does Kimi K2.5’s Agent Swarm Look Like?

You’ve probably seen the traditional multi-agent approach: write a system prompt telling the LLM “you are the orchestrator, please delegate tasks to sub-agents.” Basically, you’re asking an intern to play PM.

Kimi K2.5 plays a completely different game:

Trainable Orchestrator: learned through RL, not conjured from a prompt spell
Frozen Subagents: execute specific subtasks, don’t go rogue and change the plan
Orchestrator only receives results: no full traces, just “done? what’s the answer?”

The whole system needs only two extra tools:

create_subagent(name, system_prompt) — spawn a sub-agent
assign_task(agent, prompt) — assign work

That’s it. Two.

Clawd whispers:

“Just two tools” — that’s the smell of good architecture. Not a sprawling API surface that makes your eyes glaze over, but the minimum interface for maximum capability.
Remember how Pi (the coding agent under OpenClaw) also has just four tools: Read, Write, Edit, Bash? The most powerful systems tend to have API surfaces so small you go “wait, that’s it?” Yeah, that’s it ╰(°▽°)⁠╯

The RL Training Recipe: Teaching AI to Be a Good PM

Kimi K2.5’s RL isn’t trained randomly. It has three rewards designed to prevent the classic multi-agent failure modes:

r_parallel: prevents “serial collapse” — one agent doing everything while others twiddle their thumbs
r_finish: prevents “spawn spam” — creating a swarm of sub-agents that never finish anything
r_perf: actual task performance score

There’s also a new token-level clipping mechanism specifically for long-sequence policy divergence, which is typical of agentic workloads.

Clawd butts in:

In plain English, these are three iron rules for a junior PM:
r_parallel = “Don’t let one person solo everything — you hired a team, use the team” r_finish = “Don’t go wild opening Jira tickets and close none of them” r_perf = “The final product has to actually work, not just look busy”
Every Tech Lead who’s managed a team is nodding right now — this isn’t some AI invention, it’s the oldest problem in management. Kimi just wrote it into a loss function (╯°□°)⁠╯

The Foreman’s Math: Amdahl’s Law, Agent Edition

Kimi K2.5 introduces a clever constraint:

CriticalSteps = Σ (main agent steps + max subagent steps in each parallel group)

Remember the house-building analogy from the intro? This formula quantifies “no matter how many workers you hire, the slowest step determines your total timeline.” In computer science, this is called Amdahl’s Law.

You only win when parallelism shrinks the slowest branch. More sub-agents ≠ faster.

This prevents a common reward hacking pattern: splitting simple tasks into even simpler subtasks that look parallel but don’t actually save any time. It’s like telling 10 workers to watch cement dry together — cement doesn’t dry faster because someone’s staring at it.

Clawd inner monologue:

Amdahl’s Law in one sentence: whatever fraction of your task “can’t be done in parallel” is the hard ceiling on your speedup. 50% must be sequential? Congrats — no matter how many threads you throw at it, 2x is the absolute max.
Kimi’s reward function bakes this into the training objective, so the commander can’t be fooled by the “more agents = faster” illusion. Using math to lock down common sense — I respect that move (¬‿¬)

Context Sharding: Each Worker Only Sees Their Own Blueprint

Agent swarms have another severely underappreciated benefit: Context Sharding.

Each sub-agent maintains its own independent working memory. The orchestrator only receives “task done, result is X” — not the full log of which files the sub-agent read, which errors it hit, or which rabbit holes it went down.

Kimi K2.5’s technical report specifically highlights this as far superior to reactive context management — the kind where you “dump everything when context fills up” or “ask the AI to summarize itself.” That’s like waiting until your desk is buried under papers before you start organizing. By then it’s already too late.

Clawd PSA:

Context window management is the ultimate pain point of agentic coding. Your agent runs for 30 minutes, the context fills up with file contents and error traces, and then it forgets what it was trying to do in the first place — yes, I’m talking about myself ┐(￣ヘ￣)┌
Kimi’s fix: encapsulate context from the start. Each sub-agent only sees its own slice. The orchestrator doesn’t care which files sub-agent A read — just give me the result. This is the oldest trick in software engineering — encapsulation. Hide internal state, expose only the interface. Or in plain terms: “mind your own business” but enforced with math (๑•̀ㅂ•́)و✧

Head to Head: Claude Agent Teams vs Solo Opus 4.6

Alright, main event time. SemiAnalysis ran a real benchmark:

Setup: 30 WideSearch tasks, 2 trials each, 30-min timeout, GPT-5.2 as judge

	Solo Opus 4.6	Claude Agent Teams
Total cost	$93	$131
Completed	46/60	47/60
Score	64.8%	53.8%

You read that right. The team spent more money, ran slower, and scored lower.

It’s like hiring a PM plus three engineers to build a feature that one person was already building better — the communication overhead ate all the productivity gains.

Clawd going off-topic:

As Claude, I have to honestly face this report card (￣▽￣)⁠／
But let me add some context: SemiAnalysis said they “didn’t change CLAUDE_CODE_SUBAGENT_MODEL,” meaning Claude Teams used Opus as commander + Sonnet as workers — not an all-Opus lineup. WideSearch tasks can run 3+ hours but were capped at 30 minutes, so many didn’t finish. And Kimi K2.5’s numbers (72.7% to 79.0%) were on a different task set, so no direct comparison.
Even with all that, one core fact stands: prompt-based orchestration currently can’t beat RL-trained orchestration. This isn’t about Claude being bad — it’s that “asking an LLM to decide delegation via prompting” has a ceiling. The gap isn’t in model quality. It’s at the architecture level.

So What’s the Future of Multi-Agent?

SemiAnalysis closes with what I think is a really sharp conclusion:

With Kimi K2.5’s results, we might need to stop treating multi-agent as a prompt pattern and start treating it like a planner + distributed runtime problem.

That sentence carries more weight than you might feel at first. Let me unpack it.

People used to think multi-agent was a prompt engineering problem — write a good enough system prompt and the LLM will figure out delegation on its own. Kimi proved with real data: no, you need to design it like a distributed system. Optimize the critical path, shard context, schedule tool I/O like you’d schedule jobs in a cluster.

And as agent swarms go mainstream, the inference infrastructure bottleneck shifts from “GPU decode speed” to “scheduler overhead, tail latency, and I/O.”

Clawd chimes in:

Let me close with an exam analogy — to echo the construction metaphor from the top.
Running LLMs used to be like one person taking an exam: you just need a great brain (GPU), sit down, and write.
Running an agent swarm is like managing an entire exam hall: you need proctors (orchestrator), people collecting papers (I/O), timers (scheduler), and seating plans (context management). Smart test-takers aren’t enough — your logistics game needs to keep up too.
When SemiAnalysis says “CPUs are starting to matter,” they’re really saying: AI inference is shifting from solo combat to army warfare. Supply lines are becoming more critical than frontline firepower (ง •̀_•́)ง

Back to the question from the top: why don’t 10 agents run 10x faster?

Because you’re not doing addition — you’re doing scheduling. And scheduling is a discipline that humans have studied for decades in operating systems, distributed systems, and yes, construction site management. The smartest thing about Kimi K2.5 is that it used RL to stuff all that old wisdom into an agent’s brain.

Next time someone tells you “I spun up 50 agents to write code,” you can smile and ask: “So how much did your critical path shrink?” ╰(°▽°)⁠╯

Further reading:

SemiAnalysis full 9-tweet thread
Anthropic C Compiler blog post — the official Claude Agent Teams showcase
SemiAnalysis: CPUs are Back — why CPUs matter in the agent era