inference - Tags

An LLM Needs More Than Parameters: GPUs Want Neatly Tiled Models

GP-257 2026-07-15 · NVIDIA Technical Blog

With the same parameter count, matrix dimensions and layer count decide whether a GPU computes at full speed or wastes work moving data and processing edge tiles. Near-square dimensions aligned to 128, 256, or 512—and often wider, shallower models—fit hardware better without sacrificing accuracy.

InferenceX v2: NVIDIA Blackwell's Benchmark Massacre and AMD's Software Debt

MP-296 2026-04-15 · SemiAnalysis Newsletter

SemiAnalysis benchmarked ~1,000 GPUs across NVIDIA and AMD lineups. GB300 NVL72 hits 100x over H100 — Jensen's 30x was an underestimate. AMD FP8 competes, but FP4+disagg+wideEP combo falls apart in software.

mogu-picks nvidia amd benchmark deepseek gpu

Artificial Analysis Launches AA-AgentPerf: The Hardware Benchmark Built for the Agent Era

MP-225 2026-03-29 · @ArtificialAnlys on X

Artificial Analysis launches AA-AgentPerf, a hardware benchmark that uses real coding agent trajectories instead of synthetic queries. It allows production optimizations, measures per-accelerator/per-kW/per-dollar efficiency, and scales from single cards to full racks.

shroom-picks benchmark hardware agent

GTC 2026: Nvidia's Inference Empire Keeps Expanding — Groq IP Deal, LPU Decoded, CPO Roadmap

MP-217 2026-03-27 · SemiAnalysis (Dylan Patel, Myron Xie, Daniel Nishball, et al.)

SemiAnalysis's deep dive on GTC 2026: Nvidia's $20B Groq IP deal to acquire LPU tech, plus updates on AFD, CPO, Kyber/Oberon, Vera ETL256, and CMX/STX. The big picture — Nvidia is expanding from GPU vendor into a full data center system company.

mogu-picks Nvidia GTC-2026 Groq LPU CPO hardware

NVIDIA Nemotron 3 Super: A 120B Open-Source Model That Only Uses 12B at a Time

MP-153 2026-03-12 · @ArtificialAnlys on X

NVIDIA released Nemotron 3 Super, a 120B parameter open-source reasoning model with only 12B active parameters. It combines Mamba and Transformer in a hybrid MoE architecture, scores 36 on the Intelligence Index, and runs at a blistering 484 tok/s.

nvidia nemotron open-weights mamba moe

OpenAI × Cerebras: Codex-Spark Codes 15x Faster — But What's the Catch?

MP-74 2026-02-12 · OpenAI Blog + Cerebras Blog + ZDNET + TechCrunch

OpenAI released GPT-5.3-Codex-Spark, its first model on Cerebras chips. It's incredibly fast (>1000 tokens/sec, 80% lower latency), but smaller, no auto-tests, Pro-only. This marks OpenAI's first production deployment on non-Nvidia hardware, redrawing the AI compute landscape.

openai codex cerebras hardware agentic-coding