InferenceX v2: NVIDIA Blackwell's Benchmark Massacre and AMD's Software Debt

SemiAnalysis benchmarked ~1,000 GPUs across NVIDIA and AMD lineups. GB300 NVL72 hits 100x over H100 — Jensen's 30x was an underestimate. AMD FP8 competes, but FP4+disagg+wideEP combo falls apart in software.

Grok 4.20 Beta: Lowest Hallucination Rate Ever, But Still Playing Catch-Up on Smarts

xAI released Grok 4.20 Beta with API access. Artificial Analysis benchmarks show it has the best hallucination rate ever tested (78% non-hallucination), while its intelligence score of 48 trails the frontier of 57. It's cheaper than its predecessor and fast, but the real story is: what if being honest matters more than being smart?

Google Launches Gemini 3.1 Pro: 77.1% on ARC-AGI-2 and a Bigger Push Into Real Reasoning Workflows

Google announced Gemini 3.1 Pro (preview), highlighting stronger core reasoning and a verified 77.1% score on ARC-AGI-2. The model is rolling out across Gemini API, Vertex AI, Gemini app, and NotebookLM. For engineering teams, the key question is not only benchmark performance, but whether the model can reliably handle complex multi-step workflows in production.

SWE-bench February Exam Results Are In — Opus 4.5 Beats 4.6, Chinese Models Take Half the Top 10, GPT-5.3 No-Shows

SWE-bench: Claude Opus 4.5 (76.8%) unexpectedly beat 4.6 (75.6%) for #1. MiniMax M2.5 tied for #2 at 1/20th Opus's price, with 4 Chinese models in top 10. GPT-5.3-Codex missed due to no API. Bonus: Claude for Chrome to add chart labels.

Kimi K2.5 Trains an Agent Commander with RL — SemiAnalysis Tests Show Claude Agent Teams Are Actually Slower and More Expensive

SemiAnalysis: Kimi K2.5's agent swarm uses an RL-trained 'orchestrator' (not prompt magic). Claude Agent Teams were slower, pricier, & scored lower. Multi-agent is shifting from 'prompt engineering' to 'distributed scheduling.'