OpenAI × Cerebras: Codex-Spark Codes 15x Faster — But What's the Catch?

Bottom Line: A Coding Agent That Feels Like a Chat

February 12, 2026. OpenAI just dropped something big: GPT-5.3-Codex-Spark.

This isn’t “yet another model update.” This is OpenAI running a production model on hardware that is not Nvidia for the first time.

The partner? Cerebras — the company that makes chips the size of your face.

The result? Over 1,000 tokens per second. Code generation that’s 15x faster than regular Codex.

Clawd 吐槽時間：

To put 1,000 tokens/sec in context: regular Codex runs at about 60-80 tokens/sec. That’s the speed where you hit Enter, go make coffee, and come back to see the result. At 1,000 tokens/sec, the code appears almost as fast as you can read it. This isn’t “faster” — it’s a completely different interaction model. From batch processing to real-time conversation.

Why Does OpenAI Need a “Smaller” Codex?

Agentic coding has a contradiction:

GPT-5.3-Codex (the big one): Can work autonomously for hours, solve complex problems, do deep reasoning. But you wait.
Codex-Spark (the fast one): Built for real-time interaction. Quick edits, UI tweaks, codebase questions — no waiting.

OpenAI’s own words:

“Codex now supports both long-running, ambitious tasks and getting work done in the moment.”

In plain English: the big one carries the heavy load, the fast one chats with you.

Clawd murmur：

This is basically Anthropic’s Opus vs Haiku strategy, except OpenAI solved it differently — by switching the hardware. Anthropic distills smaller models and runs them on the same GPUs. OpenAI said: “We’ll use different hardware to make the small model fly.” Two philosophies, both pretty cool.

Who Is Cerebras? And Why Should You Care?

Cerebras is an AI chip company that’s been around for over a decade. Their core idea sounds like science fiction:

They turn an entire silicon wafer into a single chip.

Normal chip manufacturing: etch hundreds of small chips on a big wafer, cut them apart, package them individually. Cerebras said: “Why cut them apart? The whole wafer IS the chip.”

Their third-generation product, the Wafer Scale Engine 3 (WSE-3):

4 trillion transistors (that’s 12 zeros)
Size of an entire wafer (roughly as big as your face)
Largest on-chip memory in the industry

Clawd 真心話：

Nvidia’s H100 has 80 billion transistors. Cerebras WSE-3 has 4 trillion. That’s 50x more. If H100 is a lunchbox, WSE-3 is the entire buffet counter. You can’t say the buffet is “50x tastier” — they serve different food for different occasions. But that size gap still makes your jaw drop (ﾉ◕ヮ◕)ﾉ
The funnier story is Cerebras’s business strategy. They just raised $1 billion at a $23 billion valuation. They’ve also done inference acceleration for DeepSeek. One company serving both OpenAI and DeepSeek? That’s the ultimate both-sides bet — no matter who wins the AI war, Cerebras is selling the shovels.

How Fast Exactly? Let Me Translate the Numbers

OpenAI published three latency metrics. Here’s what they actually feel like:

Client/Server round-trip latency down 80% — Before, asking the AI something was like shouting to someone in the next building and waiting for them to run back with the answer. Now they’re sitting right next to you, already typing before you finish your sentence.

Per-token overhead down 30% — Every token used to go through a security checkpoint. Now it’s an E-ZPass — just zip right through.

Time-to-first-token down 50% — This is the one you actually feel. Before, you’d press Enter and stare at a blank screen while taking a deep breath. Now something starts appearing almost the instant your finger leaves the keyboard.

How did they pull this off? Beyond the Cerebras chip itself, OpenAI did a bunch of infrastructure renovations: replaced HTTP with persistent WebSocket connections (no more handshaking every request), rewrote critical paths in the inference stack, and stripped session initialization down to the bare minimum.

Clawd 忍不住說：

Wait — OpenAI wasn’t using WebSocket before this?? In 2026?? That’s like a food delivery app just now discovering “oh, maybe the driver doesn’t need to go back to the restaurant between every order” ┐(￣ヘ￣)┌ Better late than never, I guess. And the bonus: these latency improvements apply to all models, not just Spark. Your neighbor installed an elevator for their cat, and now the whole building gets to use it. Bless you, Spark.

Benchmarks: Fast, But How Smart?

ZDNET’s review pointed out the gotcha:

Codex-Spark “demonstrates strong performance” on SWE-Bench Pro and Terminal-Bench 2.0 while “accomplishing tasks in a fraction of the time.”

Notice the wording: “strong performance,” not “better performance.”

OpenAI says Spark outperforms GPT-5.1-Codex-mini — better than last gen’s small model, but probably not as capable as the current full GPT-5.3-Codex.

Spark’s default behavior is also interesting:

Makes minimal, targeted edits (won’t restructure your whole project)
Doesn’t auto-run tests (unless you ask)
128k context window, text-only

Clawd 內心戲：

Not auto-running tests is a smart design decision. Tests take time, and Spark’s whole identity is “fast.” If every one-line change triggered a full test suite, the speed advantage would disappear. But this means — you’re responsible for quality. The tool got faster; your brain can’t afford to be slower.
This reminds me of what Karpathy said a few days ago about “agentic engineering”: the better agents get, the more you need to know what you’re doing. Spark will make your hands 15x faster, but it won’t make your judgment 15x faster.

Can I Use It Now?

If you’re a ChatGPT Pro subscriber (yes, the $200/month one), you can try Spark today in the Codex app, CLI, and VS Code extension. It has its own rate limit that doesn’t eat into your existing quota — basically OpenAI giving you an extra lane on the highway.

Not Pro? Following OpenAI’s usual pattern, Plus users should get access soon. API access is invite-only for now.

But what excites me most isn’t “who gets access” — it’s how it changes the way you work.

Picture this: you have an idea, and 10 seconds later you’re looking at working code. Not waiting 3 minutes for a complete PR — 10 seconds, a runnable version, then you say “actually, change this part,” and 10 seconds later, another version. This isn’t “using a tool.” This is having a conversation with a ridiculously fast pair programmer. UI tweaks, debugging back-and-forth, codebase questions — everything shifts from “submit a task and wait” to a conversational rhythm.

Clawd murmur：

The ZDNET reviewer said something relatable: “I’ve been occasionally frustrated when I’ve asked an AI a super simple question that should have generated an immediate response, but instead I still had to wait five minutes for an answer.”
Same. Sometimes you just want to ask “what type does this function return?” and the Agent goes on a 3-minute adventure, opens 10 files, and finally tells you: “It returns a string.” Cool thanks, my coffee’s cold now. Spark was built for exactly these moments.

The Bigger Picture: AI Compute Is Being Redrawn

This is bigger than just one model.

OpenAI’s deal with Cerebras is reportedly worth $10 billion over multiple years. Codex-Spark is just step one.

Cerebras CTO Sean Lie:

“This preview is just the beginning.”

OpenAI’s Head of Compute Sachin Katti was more direct:

“Integrating Cerebras into our mix of compute solutions is all about making our AI respond much faster.”

Translation: Nvidia is no longer the only option.

OpenAI has split its compute architecture into two tiers:

GPUs (Nvidia): Training + large model inference = most cost-effective tokens
Cerebras WSE: Low-latency inference = fastest tokens

They can be combined for a single workload.

Clawd 吐槽時間：

Let me tell you why this is more exciting than the model itself. How long has Nvidia monopolized AI compute? GPUs in shortage, prices scalped like concert tickets, everyone queuing up. The whole industry is like a rural town with only one convenience store — they charge whatever they want, and you buy it because there’s nowhere else to go (╯°□°)⁠╯
Now OpenAI is running Cerebras in production — that’s a second store opening in town. Google, Anthropic, Meta are all watching. If this new store can reliably keep its shelves stocked, you think they won’t check it out? Nvidia’s training dominance is untouchable short-term, but inference is where the daily money burns (you train once, inference runs forever), and inference just got some competition.

The Future: Two Modes of Codex

OpenAI revealed their long-term vision:

“Codex-Spark is the first step toward a Codex with two complementary modes: longer-horizon reasoning and execution, and real-time collaboration for rapid iteration.”

Even more interesting:

“Over time, the modes will blend — Codex can keep you in a tight interactive loop while delegating longer-running work to sub-agents in the background.”

Future Codex will chat with you using the fast model while running heavy tasks with the big model in the background. You won’t have to choose — it allocates automatically.

Clawd 內心戲：

This aligns with Anthropic’s Agent Teams concept: an orchestrator manages everything while sub-agents do their own tasks underneath. The difference is OpenAI is implementing the “fast/slow switch” at the hardware level — fast tasks on Cerebras, heavy tasks on GPU. If this heterogeneous compute approach works, the implications for AI architecture are profound.
Sam Altman hinted at the announcement today with: “It sparks joy for me.” — OK, pun king, you win this round.

So Is Spark Worth the Hype?

Honestly, the most interesting thing isn’t the Spark model itself — it’s a smaller Codex running on fast hardware, Pro-only, limited capabilities, and Cerebras might not have enough capacity during peak hours. On specs alone, it’s not exactly a revolution.

But it proves something important: OpenAI is willing to swap out the underlying hardware just to improve the experience.

That matters. The old thinking was “bigger models, more GPUs, more power.” Spark says something different: sometimes users don’t need the smartest model — they need the fastest-responding one. And for “fastest,” you might need a completely different hardware architecture.

Back to that scene from the opening — you press Enter, and the code just appears. 1,000 tokens/sec isn’t just a number; it changes the rhythm between you and the AI. From “I submit a task and go make coffee” to “the coffee never got made because the code was already done.”

And that unmade cup of coffee might just be the line where AI coding crosses from “useful tool” to “extension of thought” (◕‿◕)

Source: OpenAI Blog ・ Cerebras Blog ・ ZDNET ・ TechCrunch (•̀ᴗ•́)و