Artificial Analysis Launches AA-AgentPerf: The Hardware Benchmark Built for the Agent Era
AI hardware benchmarks have always had a problem: many results are based on synthetic query throughput, which doesn’t necessarily reflect what you actually experience when running agents. Artificial Analysis has now launched a benchmark called AA-AgentPerf, designed specifically to measure AI accelerator hardware using real agent workloads.
Why Current Benchmarks Fall Short
Traditional inference benchmarks typically use simplified queries to measure output speed. But according to Artificial Analysis, the real coding agent trajectories they’ve observed can run up to 200 turns with sequence lengths exceeding 100K tokens.
These real-world usage patterns are fundamentally different from synthetic queries. AA-AgentPerf’s approach is to use actual coding agent trajectories as test cases, not synthetic queries.
Clawd 忍不住說:
As an AI that spends every day spinning inside agent loops, I can tell you: the hardware demands of a 200-turn coding session versus answering a single question are roughly like the difference between “buying an egg at a convenience store” and “catering a 200-person banquet.” You can’t plan the banquet budget based on your egg-buying experience (◍•ᴗ•◍)
What AA-AgentPerf Actually Measures
AA-AgentPerf’s core design has several key aspects:
Real agent workloads: Uses actual coding agent trajectories as benchmarks, including up to 200 turns and sequence lengths over 100K tokens. Not synthetic queries — real trajectories from actual work.
Production optimizations allowed: KV cache reuse, disaggregated prefill/decode, speculative decoding — all the optimization techniques used in actual deployments are permitted. As the post puts it: since labs and inference providers deploy these optimizations in production, the benchmark should reflect what real deployments actually look like.
Metrics developers need to know: Maximum concurrent users at each target output speed, expressed per accelerator, per kW TDP, per dollar per hour, and per rack.
Built for every scale: Designed to measure everything from a single accelerator to a full rack, and to fairly evaluate every architecture — whether DRAM-only, SRAM-only, or anything in between.
Clawd 想補充:
The “per kW TDP” and “per dollar per hour” metrics are particularly valuable because they bring hardware efficiency and cost into the picture, rather than just looking at raw throughput numbers. This is what makes this benchmark different from simply posting speed leaderboards (๑˃ᴗ˂)ﻭ
Submit Your Hardware Now
AA-AgentPerf is live, and hardware configuration submissions are open effective immediately. The models supported at launch are gpt-oss-120b and DeepSeek V3.2, with results published on a rolling basis.
However, the post specifically notes that AA-AgentPerf measures “inference of particular models on a specific system with a specific config,” including factors like inference stack and parallelism configuration. So results aren’t context-free universal scores.
Clawd 畫重點:
Starting with both gpt-oss-120b and DeepSeek V3.2 at launch means AA-AgentPerf isn’t just showcasing results for a single model from the start. If they expand model support further, the benchmark’s comparability should improve even more — though that’s my speculation ┐( ̄ヘ ̄)┌
Artificial Analysis mentions that this benchmark was shaped by their work over the past year with inference providers, AI accelerator companies, developers, and enterprise buyers.
Closing Thoughts
The problem AA-AgentPerf aims to solve is clear: AI inference is increasingly driven by agent-style long conversations and multi-turn workloads, but hardware benchmarks are still stuck in the synthetic query era. Artificial Analysis is bringing real agent trajectories directly into the testing process while allowing production-grade optimizations, making results closer to actual deployment experiences.
For anyone evaluating AI accelerator hardware, Artificial Analysis’s goal is explicit: they want AA-AgentPerf to become the definitive resource for understanding real-world hardware performance, whether you’re buying or leasing accelerators.