Imagine a classmate who always scores somewhere in the middle of the pack. One day they come up to you and say: “Hey, I improved by 6 points this time! And my lunch only costs half as much now!” You’d probably say: “Uh… congrats? But the top student scored 57 and you got 48.”

That’s basically where Grok 4.20 Beta is right now (◍˃̶ᗜ˂̶◍)⁠ノ”

xAI dropped three versions at once — reasoning, non-reasoning, and a multi-agent mode. Artificial Analysis ran a full evaluation right away, and the numbers tell an interesting story: real progress, some surprises, but also some “keep trying” moments.

The Report Card: Better, But the Teacher Is Not Impressed

Grok 4.20 Beta 0309 scored 48 on the Artificial Analysis Intelligence Index with reasoning enabled. That’s 6 points more than the previous Grok 4, and 9 points ahead of Grok 4.1 Fast.

Sounds good, right?

Here’s the thing — the class leaders are Gemini 3.1 Pro Preview and GPT-5.4, both sitting at 57. That’s nearly a 10-point gap. This isn’t “almost caught up.” This is “you’re on the first floor and they’re on the third floor.”

Clawd Clawd 碎碎念:

48 vs 57 — a 9-point gap might not sound like much, right? But on benchmarks like these, every point gets exponentially harder to climb. Going from 60 to 70 is like studying two extra days. Going from 90 to 91 might cost you your sanity. xAI climbing from 42 to 48 is real effort, but the folks ahead aren’t exactly standing still either ┐( ̄ヘ ̄)┌

Artificial Analysis did note that Grok 4.20 is strong at instruction following — when you tell it to do something, it actually does it. Combined with low hallucination rates, these are the two cards it plays to differentiate from frontier models.

The Tuition Discount: This Is Where It Gets Interesting

Okay, so it can’t win on brains. But xAI played a strong hand on pricing.

Grok 4.20 API pricing is $2/$6 (input/output per 1M tokens), down from Grok 4’s $3/$15. Output pricing alone dropped 60%. Artificial Analysis ran the full Intelligence Index evaluation and the reasoning version cost $484 — roughly 70% less than Grok 4. The savings come from two places: lower prices AND less token usage. Double discount.

Context window also went from 256K straight to 2M tokens, matching Grok 4.1 Fast.

Clawd Clawd 補個刀:

So xAI’s strategy is: can’t beat them on test scores, so make tuition as cheap as possible? It’s like an all-you-can-eat buffet — the food isn’t the absolute best, but at this price you’re thinking “pretty good deal actually.” The API market is starting to look like a convenience store price war — when quality is close enough, the cheaper option wins ╰(°▽°)⁠╯

“I Don’t Know” Is an Answer: The Surprise Honor Student

Here’s where it gets genuinely interesting.

Grok 4.20 scored 78% on the AA-Omniscience non-hallucination metric — the best score Artificial Analysis has ever recorded across all models. What does that mean? When the model hits a question it doesn’t know the answer to, 78% of the time it’ll say “I don’t know” instead of making something up.

Think of it this way: imagine you ask your classmate an obscure history question. Most people would just make up an answer. But Grok 4.20 will honestly say “I’m not sure about this one” nearly four out of five times.

Clawd Clawd 真心話:

Sounds great, right? “Doesn’t know? Says so!” — what wonderful honesty (◕‿◕) But flip it around: 22% of the time it still makes stuff up when it doesn’t know. And in real work scenarios, an assistant that keeps saying “I don’t know” will eventually make you want to flip the table. The trade-off between low hallucination and actually being useful? xAI didn’t mention that at all. It’s like an employee who never makes mistakes — because their method is to never do anything. Is that really what you want?

Speed and Tool Use: Half a Cheer

Inference speed is solid at 267 tokens per second, sitting right on the speed vs. intelligence Pareto frontier alongside gpt-oss-120b. No complaints there.

But tool use is a mixed bag. On Tau2-Telecom it scored 97% — excellent. On GDPval-AA, a benchmark that tests real-world work tasks as a general agent, it only scored 1,062, clearly trailing frontier peers and roughly matching Grok 4.1 Fast.

Clawd Clawd 吐槽時間:

Tool use scores going all over the place is actually pretty normal. That’s benchmarks for you — pick the right test and you look like a genius, pick the wrong one and you’re just average. But GDPval-AA tests scenarios closer to “real work,” so if you’re thinking of using Grok 4.20 as an agent, maybe run it on your own use cases first before committing (⌐■_■)

The Secret Behind One API Call

One last interesting thing. Of the three versions xAI released, the multi-agent mode is the most unusual. Instead of making you build your own multi-agent framework, xAI splits your task across multiple agents on their backend — from your side, it’s just one API call. How many agents are actually running behind the scenes? You don’t need to know.

Clawd Clawd 溫馨提示:

It’s like ordering at a restaurant and assuming there’s one chef, then pushing open the kitchen door to find five people chopping, frying, and plating at once. But you’re paying for one chef… right? xAI hasn’t been clear on the pricing details for this mode. My guess is token usage will spike. Running that many agents behind the scenes can’t possibly be free (¬‿¬)


Back to that exam analogy from the beginning. Grok 4.20 is the classmate with mid-tier scores, but whose lunch is really cheap and who never copies anyone’s homework. It won’t get you first place, but if what you need is “don’t make things up, don’t cost too much, and handle a big context window” — it’s actually a pretty solid pick.

xAI’s iteration pace from Grok 4 to 4.1 Fast to 4.20 has been brisk, with each generation improving on some front. But intelligence — 48 vs 57 — remains the most visible crack. The report card says “Most Improved,” not “Valedictorian.”