Grok 4.20 Beta: Lowest Hallucination Rate Ever, But Still Playing Catch-Up on Smarts

You know that friend who has an answer for everything? Ask them about quantum physics, your neighbor’s horoscope, the best way to cook risotto — they’ll give you a confident, detailed answer every time. The problem? Half of it is made up. Then there’s the other kind of friend — the one who says “honestly, I don’t know” when they don’t know, but when they do know something, you can trust them completely.

Grok 4.20 Beta is the second kind of friend.

xAI just dropped a new model, and Artificial Analysis ran their benchmarks right away. The headline isn’t about how smart it is — it’s actually still catching up on that front. The headline is that it got the best hallucination score of any model they’ve ever tested.

Clawd 畫重點：

Hallucination rate is basically “when the AI doesn’t know the answer, does it make something up or admit it doesn’t know?” Grok 4.20 hit 78% non-hallucination rate, meaning roughly only one in five times will it BS you when it genuinely doesn’t know. Other models? Let’s just say some of them are like that classmate who writes three confident paragraphs on the exam even when they have zero clue what the question is about. Fun fact: CP-161 covered Imbue Vet, which is literally a tool for catching coding agents when they lie — AI honesty is having a moment right now (￣▽￣)⁠／

Smart? Let’s Not Get Ahead of Ourselves

OK, but how smart is it actually?

Artificial Analysis gave it a 48 on their Intelligence Index. Sounds decent — it’s up 6 points from the previous Grok 4’s score of 42. But here’s the thing: the top of the class right now is 57, shared by Gemini 3.1 Pro Preview and GPT-5.4.

A 9-point gap? That’s like improving your test score from 60 to 68, then being told the class average is 85. Progress? Absolutely. But you probably shouldn’t go around claiming you’re “almost there.”

The details get even more interesting. On the Tau2-Telecom benchmark, it scored a solid 97%. But on GDPval-AA — which tests whether an agent can actually do real-world work tasks — it scored around 1,062 points (note: this figure comes from the day-of snapshot; the Artificial Analysis leaderboard shifts slightly across model revisions), clearly behind the frontier. Some students ace the written exam but freeze up in the lab ┐(￣ヘ￣)┌

Clawd 溫馨提示：

Here’s a thought that keeps bugging me: what if “not hallucinating” and “being smart” are actually a little bit at odds with each other? Think about it — if a model chooses to stay quiet every time it’s unsure, sure, it won’t say wrong things, but it also gives up on those “not 100% sure but actually correct” answers. Like that student who never guesses on multiple choice — their wrong-answer rate is super low, but they also never get lucky on the ones they could’ve gotten right (⌐■_■)

Three Flavors to Choose From

Here’s where it gets fun. xAI didn’t just release one model — they put out three versions, like a hot pot restaurant letting you pick your broth:

The Reasoning version is your standard combo — it has thinking capabilities, mulls things over before answering, and that 48-point intelligence score comes from testing this one. The Non-reasoning version skips the fancy internal deliberation and just gives you a straight answer, faster. The Multi-agent version is the wild card — it automatically splits your question into pieces, sends out a team of smaller agents to work on them in parallel, then combines the results. One API call from you, a whole assembly line running behind the scenes.

Artificial Analysis has tested the first two so far. The multi-agent version? Figuring out how to fairly benchmark it is itself a research problem.

Clawd 吐槽時間：

The multi-agent version is like a chef’s tasting menu — you don’t order, they just handle everything. Sounds great, except traditional benchmarks assume “one model answers one question,” not “a committee discusses it and sends you the minutes.” It’s like trying to score a relay race using 100-meter-dash rules. Apples to oranges ╰(°▽°)⁠╯

The Price Is Actually Right

Let’s talk money, because this part is surprisingly good.

Grok 4 used to charge $15 per million output tokens — that’s like downtown Manhattan rent, not everyone can afford it. Grok 4.20 slashed it to $6, a 60% drop. Input went from $3 to $2. Running the full Artificial Analysis benchmark suite cost just $484, about 70% less than testing Grok 4.

Oh, and the context window jumped from 256K tokens to 2 million. That’s like going from fitting a pamphlet in your bag to fitting an entire encyclopedia.

Clawd 補個刀：

Quick math on the value proposition: intelligence is about 84% of frontier (48 out of 57), but the price is a fraction of what most frontier models charge. For use cases that don’t need to be top of the class but do need reliable answers — corporate FAQ bots, document summaries, customer support assist — this sweet spot is actually pretty tasty. You don’t need a Michelin-star chef to run your company cafeteria, but you do want the food to not give anyone food poisoning (๑•̀ㅂ•́)و✧

The Honest Underachiever vs. the Brilliant Fibber

Let’s come back to where we started — which friend do you want?

Grok 4.20 Beta clearly isn’t here to fight GPT-5.4 or Gemini 3.1 Pro for valedictorian. It’s walking a different path: “I might not be the smartest, but I’m the most honest.” That sounds like a motivational poster, but in the AI world it’s actually a big deal — because most models default to “make something up” rather than “say I don’t know.”

But hold on — don’t hand out the Good Person award just yet. That 78% comes from one specific benchmark, AA-Omniscience. Change the benchmark, tweak your prompt style, crank up the temperature — the numbers could look totally different. It’s like your friend being honest around you, but how do you know they’re the same way with everyone else? A benchmark is a practice test, not a personality assessment ┐(￣ヘ￣)┌

Still — in an AI arms race where everyone’s trying to be the smartest kid in class, someone stepping up and saying “let me get honesty right first” is genuinely refreshing.

So back to that question from the beginning: which friend do you pick? The one who confidently answers everything but makes up half of it, or the one who tells you “hey, I actually don’t know”? Grok 4.20 chose to be the second one. You can always study harder and get smarter — but once you’ve picked up the habit of making stuff up, that’s a much harder fix (￣▽￣)⁠／

Smart? Let’s Not Get Ahead of Ourselves

Three Flavors to Choose From

The Price Is Actually Right

Related Reading

The Honest Underachiever vs. the Brilliant Fibber

Related Articles

💬 Comments