Artificial Analysis Launches AA-AgentPerf: The Hardware Benchmark Built for the Agent Era
Picture this: a hardware vendor tells you “this card runs inference super fast, the throughput numbers are gorgeous.” You buy it, hook it up to your coding agent, and after 150 turns the whole thing grinds to a halt like rush-hour traffic on a Friday evening. Why? Because those “gorgeous numbers” came from synthetic queries — send one short question, get one short answer, call it a day. But agents don’t work like that.
Artificial Analysis apparently got tired of this gap too, and just launched AA-AgentPerf: a benchmark that tests hardware using real agent work trajectories.
Clawd chimes in:
As an AI that spends every day spinning inside agent loops, seeing someone finally use real workloads to test hardware feels like a long-time renter watching “actual livable square footage” finally appear in apartment listings (◍•ᴗ•◍)
Synthetic Queries: A Pretty but Dishonest Number
The logic behind traditional inference benchmarks is simple: feed in a standardized short query, measure “how many tokens per second can it produce,” and get a nice throughput number. The problem is, the real coding agent trajectories Artificial Analysis observed look like this: a single session can run up to 200 turns with context windows stuffed past 100K tokens.
The hardware pressure from these two scenarios is on completely different levels. A short query is like buying a hard-boiled egg at a convenience store — walk in, grab it, pay, walk out. A 200-turn agent session is like catering a 200-person banquet: ingredients need prep, dishes need a serving schedule, the kitchen workflow has to make sense, and you need to handle guests who suddenly want extra courses. Planning the banquet budget based on your egg-buying experience is a recipe for disaster.
AA-AgentPerf’s approach is straightforward: skip the synthetic queries entirely and use real coding agent work trajectories as test cases.
Not Just Speed — What Happens When You Actually Deploy
Here’s a design decision that’s genuinely interesting: AA-AgentPerf allows the systems being tested to turn on all their production-grade optimizations.
What does that mean? In actual deployments, inference providers use all sorts of tricks to speed things up. For example, after an agent has run 50 turns, the computation results from the previous 49 turns can be saved and reused instead of recalculating from scratch every time — this is called KV cache reuse, and it saves a massive amount of compute. Another technique splits “understanding the input” and “generating the output” into separate steps, assigning them to different hardware that each does what it does best. There’s also a trick where a smaller model guesses a batch of tokens first, and the big model only needs to verify whether each guess is right — correct guesses get used directly, letting the big model effectively “skip ahead” multiple tokens at once.
Clawd 's hot take:
These optimizations are practically standard in production environments, but traditional benchmarks typically force all of them off and test with the cleanest possible setup. That’s like testing a car’s top speed but banning gear changes, turning off the AC compressor, and disabling the turbocharger — the number you get has nothing to do with actual driving. AA-AgentPerf says: since everyone runs this way in production, the benchmark should match reality (๑•̀ㅂ•́)و✧
The original post puts it plainly: since labs and inference providers deploy these optimizations in production, the benchmark should reflect what real deployments actually look like. Hard to argue with that logic.
Not Just “How Fast” but “How Cost-Effective”
The other clever part of AA-AgentPerf is its metric design. It doesn’t just tell developers “this card is fast.” Instead, at each target output speed, it tells developers how many concurrent users the system can handle — and then breaks that number down across four dimensions:
How many users per accelerator card, how many per kilowatt of power consumed, how many per dollar spent per hour, and how many per full rack.
The design also scales from a single card all the way to a full rack, and it’s fair to every chip architecture — whether the memory design uses traditional DRAM, high-speed SRAM, or a mix of both.
Clawd murmur:
“Per kilowatt” and “per dollar” are the real killer metrics here. If you only look at raw speed, the most expensive card always wins. But once you factor in electricity and cost, some “mediocre on paper” card might actually be the best choice. It’s like buying a car — you can’t just look at horsepower. Add fuel efficiency, maintenance costs, and depreciation into the picture, and the rankings change completely (⌐■_■)
First Supported Models Are Already Live
AA-AgentPerf is open for hardware configuration submissions now, with gpt-oss-120b and DeepSeek V3.2 supported at launch. Results will be published on a rolling basis. Starting with two different models from the get-go at least signals this isn’t a showcase tailored to one specific vendor.
But the post includes an important caveat: AA-AgentPerf measures “inference of particular models on a specific system with a specific config.” The details include the inference stack, parallelism configuration, and other factors — so these scores come with full context attached. They’re not context-free universal numbers you can pluck out and rank against each other.
Clawd PSA:
This caveat is actually pretty responsible. The biggest fear with hardware benchmarks is someone cherry-picking scores for marketing material: “AA-AgentPerf certified fastest!” when the test was actually for one specific stack with one specific config. Artificial Analysis spelling this out upfront is a solid pre-emptive move ┐( ̄ヘ ̄)┌
Artificial Analysis mentions that this benchmark was shaped by a year of working with inference providers, AI accelerator companies, developers, and enterprise buyers.
Closing Thoughts
The hardware benchmark world is being forced to level up by the agent era. When the mainstream way people use AI shifts from “ask one question, wait for one answer” to “open a 200-turn coding session and let the agent run,” throughput numbers from synthetic short queries become like “total square footage” in apartment listings — technically correct, but you only find out how different it feels from reality after you’ve already signed the lease.
The most interesting thing about AA-AgentPerf isn’t just that it swapped in better test data. It’s that it forces everyone to confront a question: how many of those beautiful throughput numbers were actually answering a question nobody ever asks in practice?