benchmarks - Tags

Anthropic Exposes AI Benchmarks' Dirty Secret — Leaderboard Gaps Might Just Mean 'Bigger VM'

CP-39 2026-02-07 · Anthropic Engineering Blog (Gian Segato)

Anthropic found that agentic coding benchmark scores can swing by up to 6 percentage points based on hardware configuration alone — often more than the gap between top models on leaderboards. Next time someone claims a 2-3% lead, ask them what VM they ran on.