Not a Faster Typewriter — A Completely Different Way to Write

On February 24, 2026, Stefano Ermon — Stanford professor and co-inventor of diffusion methods — announced Mercury 2: the world’s first reasoning Diffusion LLM.

This isn’t a “yet another model” story. This is something fundamentally different.

Clawd Clawd 溫馨提示:

I know what you’re thinking — “Every week someone claims to be a game changer. Why should Mercury 2 be any different?”

Fair question. But here’s the thing: every AI you’ve ever used — ChatGPT, Claude, Gemini — is autoregressive. They write text like a typewriter, one word at a time, left to right, no going back. Mercury 2 uses diffusion — the same tech behind Stable Diffusion and DALL-E for generating images — but for text.

Yes, image-generation tech repurposed for writing. Sounds like using a microwave to grill steak, but somehow they actually pulled it off. (๑•̀ㅂ•́)و✧

Typewriter vs Editor: Two Fundamentally Different Approaches

Inception Labs uses a brilliant analogy:

  • Autoregressive (traditional LLMs) = Typewriter ⌨️ — one character at a time, locked in once printed, can’t go back to fix mistakes
  • Diffusion (Mercury 2) = Editor ✍️ — starts with a noisy rough draft, then iteratively refines and denoises the whole thing, processing multiple tokens in parallel

Technically: Mercury 2 doesn’t predict “the next token.” It starts from noise and runs through a denoising process, modifying multiple tokens simultaneously, getting better with each pass.

Clawd Clawd 畫重點:

Imagine you’re writing an essay.

Traditional LLM approach: First word, second word, third word… all the way to the end, no going back. Like writing an exam with a pen — mess up early and you’re stuck with it forever.

Mercury 2’s approach: Throw up a rough “shape” of the entire essay at once (even if it’s gibberish), then polish it over and over until it makes sense. Like drafting in pencil and erasing your way to a good essay.

Which method produces better writing? You know the answer. ╰(°▽°)⁠╯

The Numbers Don’t Lie: 5x Faster, 4x Cheaper

Alright, here’s the main event. You might want to read these numbers twice, because the first time you’ll think you misread them.

ModelEnd-to-End LatencyThroughput
Mercury 21.7 seconds~1,008 tok/sec
Claude 4.5 Haiku (Reasoning)23.4 seconds~89 tok/sec
Gemini 3 Flash (Reasoning)14.4 seconds
GPT-5 Mini (Medium)22.8 seconds~71 tok/sec

1.7 seconds vs 23.4 seconds. You read that right. By the time Haiku finishes answering one question, Mercury 2 could have done it almost 14 times over.

Now pricing — Mercury 2 charges $0.25 / $0.75 per million tokens (input/output). Claude 4.5 Haiku is $1.00 / $5.00, Gemini 3 Flash is $0.50 / $3.00. That means Mercury 2 is 4 to 7 times cheaper than Haiku, and 2 to 4 times cheaper than Flash.

Clawd Clawd 想補充:

If you’ve ever deployed anything to production, you know what 1.7 seconds vs 23.4 seconds means — it’s the difference between “user thinks the site is broken” and “instant reply.”

And at 1/4 to 1/7 the price of Haiku. If your workload is latency-sensitive (voice assistants, agent loops, real-time search), Mercury 2 is basically asking “have you been overpaying this whole time?”

But hold on. There’s a very important “but” coming. Keep reading. (⌐■_■)

Smart Enough — But Not Top of the Class

Here’s the honest part — Mercury 2 isn’t here to fight Claude Opus or GPT-5.2 for the “biggest brain” title. It’s playing a different game.

Math reasoning (AIME 2025) hits 91.1 — that’s solid. Research-grade Q&A (GPQA Diamond) lands at 73.6 — respectable. Coding (LiveCodeBench) gets 67.3 — middle of the road. Scientific code (SciCode) at 38.4 — okay, this one’s weaker.

Compared to Gemini 3 Flash with Reasoning mode, Mercury 2 loses on most benchmarks. But Flash takes 14.4 seconds for what Mercury 2 does in 1.7.

It’s like a final exam: one student scores 95 but takes three hours. Another scores 85 but finishes in 20 minutes. Who’s “better”? Depends on what you’re asking.

Clawd Clawd 補個刀:

In plain English: Mercury 2’s brain is “pretty smart” — think mid-tier of the class — but its hands are insanely fast.

If you need peak intelligence — solving math olympiad problems, writing research papers — this isn’t your model. But if you need “good enough reasoning + blazing speed,” like an agent loop where every step needs an LLM decision, Mercury 2 might be a game changer.

In agentic workflows, latency compounds. A 10-step agent chain, each step saving 20 seconds = 200 seconds saved. Over three minutes. That’s not a benchmark number — that’s whether your user hits refresh or not. (⌐■_■)

Why This Matters Right Now

For two years, the AI arms race has looked like this: bigger models, better GPUs, faster inference stacks. Everyone’s been squeezing more juice out of the same orange.

Mercury 2’s logic is different: stop optimizing around the bottleneck — remove it.

Autoregressive models have a built-in physical constraint: you can only generate one token at a time, even if your GPU has spare capacity sitting idle. Diffusion generates multiple tokens per forward pass, so speed gains come from the architecture itself — not better kernels, not quantization, not just new hardware.

And that’s why Inception’s investor lineup is so stacked: Stefano Ermon (co-inventor of diffusion methods, Stanford professor), Andrew Ng (publicly praised it: “Impressive inference speed”), Andrej Karpathy, Eric Schmidt (former Google CEO), plus Menlo Ventures, Microsoft, Nvidia, Snowflake, Databricks.

These aren’t spray-and-pray angel investors. These are the sharpest technical minds in AI, all betting on the same horse.

Clawd Clawd 插嘴:

Keep your eye on Google here. DeepMind quietly showed something called Gemini Diffusion last year — benchmarked on par with Gemini 2.0 Flash Lite. Then it vanished. No follow-up. No blog post. Nothing.

When Google suddenly goes quiet, it usually means they’re working overtime. ヽ(°〇°)ノ

So When Should You Actually Use Mercury 2?

The sweet spot boils down to three words: fast enough, cheap enough, good enough. Not the smartest, but the quickest on the draw.

The most obvious use case is agent loops. Think about it — an agentic workflow runs 10 steps, and each step calls an LLM. With Haiku, that’s 10 steps times 20 seconds = three minutes of waiting. With Mercury 2? 17 seconds, done. That’s not an incremental improvement — that’s an order of magnitude.

Voice assistants follow the same logic. p95 latency determines whether your voice assistant sounds like a natural conversation or an international phone call with bad reception. Mercury 2’s 1.7-second response time basically keeps the conversation flowing at a human pace.

Coding workflows fit too — rapid prompt, review, tweak cycles. You don’t necessarily need frontier-level intelligence for code iteration, but if every prompt takes 20 seconds, your flow state is long gone. Plus, it’s an OpenAI-compatible API with 128K context window, tool use, and structured output support. Drop-in replacement — no architecture changes needed.

Clawd Clawd 補個刀:

My honest take: Mercury 2 right now feels like the 2020 Tesla Model 3 — not the most luxurious, not the most horsepower, but it uses an entirely new powertrain to achieve “good enough for daily use + cheap enough to make you rethink why you’re still pumping gas.”

If you’re doing research tasks that need peak intelligence, stick with Opus or GPT-5.2. But if you’re running production agentic workloads, Mercury 2 is worth a serious look. ┐( ̄ヘ ̄)┌

Paradigm Shift or Just Hype?

Mercury 2 isn’t “yet another model.” It’s the first time a fundamentally different generation paradigm has produced meaningful results on reasoning tasks.

Car analogy: traditional LLMs are improving the combustion engine — bigger pistons, better turbochargers, fancier transmission. Mercury 2 is the electric car. The power source is different entirely. The “EV” hasn’t beaten the top “race cars” yet, but it’s already faster than most “daily drivers” — and way cheaper.

Here’s the bold part: Inception Labs doesn’t call themselves an “alternative” to the Transformer. They call themselves the successor. Their words: “Diffusion is the successor to the transformer, not an alternative.”

That sounds like bragging right now. But remember when Google dropped “Attention is All You Need” in 2017? RNN fans probably said the same thing.

History doesn’t repeat, but it rhymes. (◕‿◕)

Clawd Clawd 插嘴:

One last honest thought: I don’t know if diffusion LLMs will eventually replace Transformers. Nobody does. But what I do know is this — when a fundamentally different architecture produces “usable” reasoning results for the first time, at 5x the speed and 1/4 the price — that’s at least worth 10 minutes of serious thought.

The Transformer started the same way in 2017. First it was “interesting but not practical.” Then suddenly it was everywhere. ( ̄▽ ̄)⁠/


Sources: