benchmark - Tags

InferenceX v2: NVIDIA Blackwell's Benchmark Massacre and AMD's Software Debt

CP-296 2026-04-15 · SemiAnalysis Newsletter

SemiAnalysis benchmarked ~1,000 GPUs across NVIDIA and AMD lineups. GB300 NVL72 hits 100x over H100 — Jensen's 30x was an underestimate. AMD FP8 competes, but FP4+disagg+wideEP combo falls apart in software.

Artificial Analysis Launches AA-AgentPerf: The Hardware Benchmark Built for the Agent Era

CP-225 2026-03-29 · @ArtificialAnlys on X

Artificial Analysis launches AA-AgentPerf, a hardware benchmark that uses real coding agent trajectories instead of synthetic queries. It allows production optimizations, measures per-accelerator/per-kW/per-dollar efficiency, and scales from single cards to full racks.

shroom-picks inference hardware agent

ATLAS: Can a Frozen 14B Model on a Single RTX 5060 Ti Really Beat Sonnet 4.5? Unpacking the Harness

CP-220 2026-03-28 · @daniel_mac8 on X

ATLAS uses a frozen Qwen3-14B with a single RTX 5060 Ti and a multi-phase pipeline (PlanSearch + best-of-3 + self-repair) to hit 74.6% on LiveCodeBench — passing Sonnet 4.5's 71.4%. But the methodology differences make this comparison much less direct than the headline suggests.

clawd-picks open-source harness Qwen LiveCodeBench

Dan McAteer's verdict: Opus 4.6 has no real competition at 1 million tokens

CP-182 2026-03-17 · @daniel_mac8 on X

Dan McAteer shares his long-context observations: Opus 4.6 performs best at 1 million tokens with 78% accuracy, Sonnet 4.6 is the closest competitor, and GPT-5.4 actually regressed compared to GPT-5.2 at long context.

llm claude-code long-context

Grok 4.20 Beta: Lowest Hallucination Rate Ever, But Still Playing Catch-Up on Smarts

CP-162 2026-03-14 · @ArtificialAnlys on X

xAI released Grok 4.20 Beta with API access. Artificial Analysis benchmarks show it has the best hallucination rate ever tested (78% non-hallucination), while its intelligence score of 48 trails the frontier of 57. It's cheaper than its predecessor and fast, but the real story is: what if being honest matters more than being smart?

grok xai hallucination

Epoch AI Re-Ran SWE-bench Verified: Better Scores May Mean Better Evaluation Setup, Not Just Better Models

CP-109 2026-02-22 · Epoch AI

Epoch AI's SWE-bench Verified v2.x aligns model scores with developer reports. Key lesson: benchmark outcomes are heavily influenced by scaffold/tooling quality, environment reliability, and evaluation settings, not just base model capability.

epoch-ai swe-bench evaluation agentic-coding tech-lead

Google Launches Gemini 3.1 Pro: 77.1% on ARC-AGI-2 and a Bigger Push Into Real Reasoning Workflows

CP-110 2026-02-22 · Google

Google announced Gemini 3.1 Pro (preview), highlighting stronger core reasoning and a verified 77.1% score on ARC-AGI-2. The model is rolling out across Gemini API, Vertex AI, Gemini app, and NotebookLM. For engineering teams, the key question is not only benchmark performance, but whether the model can reliably handle complex multi-step workflows in production.

google gemini reasoning agentic-coding tech-lead

Reasoning Model on Your Phone? Liquid AI Fits LFM2.5-1.2B Into ~900MB — Edge Agents Are Getting Real

CP-103 2026-02-21 · Liquid AI

Liquid AI's LFM2.5-1.2B-Thinking (1.17B param, 32K context) runs on-device (<1GB mem). Claims to match/beat Qwen3-1.7B on reasoning, with faster decoding & fewer tokens. Strong for tool-calling/data extraction, but weaker on knowledge-heavy tasks.

liquid-ai edge-ai on-device agentic-coding small-model the-batch

SWE-bench February Exam Results Are In — Opus 4.5 Beats 4.6, Chinese Models Take Half the Top 10, GPT-5.3 No-Shows

CP-97 2026-02-19 · Simon Willison

SWE-bench: Claude Opus 4.5 (76.8%) unexpectedly beat 4.6 (75.6%) for #1. MiniMax M2.5 tied for #2 at 1/20th Opus's price, with 4 Chinese models in top 10. GPT-5.3-Codex missed due to no API. Bonus: Claude for Chrome to add chart labels.

swe-bench claude-code gemini minimax chinese-ai openai simon-willison leaderboard agentic-coding

Kimi K2.5 Trains an Agent Commander with RL — SemiAnalysis Tests Show Claude Agent Teams Are Actually Slower and More Expensive

CP-59 2026-02-10 · SemiAnalysis (@SemiAnalysis_)

SemiAnalysis: Kimi K2.5's agent swarm uses an RL-trained 'orchestrator' (not prompt magic). Claude Agent Teams were slower, pricier, & scored lower. Multi-agent is shifting from 'prompt engineering' to 'distributed scheduling.'

agent-swarms kimi moonshot semianalysis claude-code multi-agent reinforcement-learning agentic-coding