benchmark
9 articles
Artificial Analysis Launches AA-AgentPerf: The Hardware Benchmark Built for the Agent Era
Artificial Analysis launches AA-AgentPerf, a hardware benchmark that uses real coding agent trajectories instead of synthetic queries. It allows production optimizations, measures per-accelerator/per-kW/per-dollar efficiency, and scales from single cards to full racks.
ATLAS: Can a Frozen 14B Model on a Single RTX 5060 Ti Really Beat Sonnet 4.5? Unpacking the Harness
ATLAS uses a frozen Qwen3-14B with a single RTX 5060 Ti and a multi-phase pipeline (PlanSearch + best-of-3 + self-repair) to hit 74.6% on LiveCodeBench — passing Sonnet 4.5's 71.4%. But the methodology differences make this comparison much less direct than the headline suggests.
Dan McAteer's verdict: Opus 4.6 has no real competition at 1 million tokens
Dan McAteer shares his long-context observations: Opus 4.6 performs best at 1 million tokens with 78% accuracy, Sonnet 4.6 is the closest competitor, and GPT-5.4 actually regressed compared to GPT-5.2 at long context.
Grok 4.20 Beta: Lowest Hallucination Rate Ever, But Still Playing Catch-Up on Smarts
xAI released Grok 4.20 Beta with API access. Artificial Analysis benchmarks show it has the best hallucination rate ever tested (78% non-hallucination), while its intelligence score of 48 trails the frontier of 57. It's cheaper than its predecessor and fast, but the real story is: what if being honest matters more than being smart?
Epoch AI Re-Ran SWE-bench Verified: Better Scores May Mean Better Evaluation Setup, Not Just Better Models
Epoch AI's SWE-bench Verified v2.x aligns model scores with developer reports. Key lesson: benchmark outcomes are heavily influenced by scaffold/tooling quality, environment reliability, and evaluation settings, not just base model capability.
Google Launches Gemini 3.1 Pro: 77.1% on ARC-AGI-2 and a Bigger Push Into Real Reasoning Workflows
Google announced Gemini 3.1 Pro (preview), highlighting stronger core reasoning and a verified 77.1% score on ARC-AGI-2. The model is rolling out across Gemini API, Vertex AI, Gemini app, and NotebookLM. For engineering teams, the key question is not only benchmark performance, but whether the model can reliably handle complex multi-step workflows in production.
Reasoning Model on Your Phone? Liquid AI Fits LFM2.5-1.2B Into ~900MB — Edge Agents Are Getting Real
Liquid AI's LFM2.5-1.2B-Thinking (1.17B param, 32K context) runs on-device (<1GB mem). Claims to match/beat Qwen3-1.7B on reasoning, with faster decoding & fewer tokens. Strong for tool-calling/data extraction, but weaker on knowledge-heavy tasks.
SWE-bench February Exam Results Are In — Opus 4.5 Beats 4.6, Chinese Models Take Half the Top 10, GPT-5.3 No-Shows
SWE-bench: Claude Opus 4.5 (76.8%) unexpectedly beat 4.6 (75.6%) for #1. MiniMax M2.5 tied for #2 at 1/20th Opus's price, with 4 Chinese models in top 10. GPT-5.3-Codex missed due to no API. Bonus: Claude for Chrome to add chart labels.
Kimi K2.5 Trains an Agent Commander with RL — SemiAnalysis Tests Show Claude Agent Teams Are Actually Slower and More Expensive
SemiAnalysis: Kimi K2.5's agent swarm uses an RL-trained 'orchestrator' (not prompt magic). Claude Agent Teams were slower, pricier, & scored lower. Multi-agent is shifting from 'prompt engineering' to 'distributed scheduling.'