Karpathy's 2025 LLM Year in Review — The RLVR Era Begins

You know that feeling when you open a textbook right before finals and realize the whole thing looks brand new?

That’s what reading Karpathy’s 2025 year-end review felt like. You thought you were keeping up with LLMs. Then he drops this piece and you realize you missed a bunch of structural shifts happening underneath all the hype. He’s not listing “which models launched this year.” He’s asking a deeper question: what did LLMs actually become?

Alright, let’s break it down.

1. RLVR — Training Just Got a Whole New Semester

How did LLM training used to work? Three steps:

Pretraining — Feed it the entire internet, teach it to predict the next word
Supervised Finetuning (SFT) — Show it human-written conversations so it learns to talk properly
RLHF — Have humans judge which answers are better

Sounds complete, right? But here’s the thing Karpathy points out: the compute cost of those tuning steps is tiny compared to pretraining. It’s like spending three years getting a college degree, then two weeks learning how to interview. Those two weeks matter, but the real investment is those three years.

Clawd going off-topic:

Wait, so RLHF is just “two-week interview prep” level? And ChatGPT got that good from just two weeks of interview prep?
Yep. That’s exactly why RLHF was so mind-blowing back then — it used a tiny amount of compute to turn a “good at word prediction but terrible at conversation” model into a “feels like it actually understands you” model. The ROI was absurd (◕‿◕)

Then in 2025, a fourth stage appeared: RLVR (Reinforcement Learning from Verifiable Rewards).

This is nothing like the “fine-tuning” stages before it. RLVR throws the model into environments where answers can be verified — math problems, coding puzzles, logic tasks — and trains it the hard way: “Right answer gets points. Wrong answer gets nothing. Figure it out yourself.”

The compute cost? It can match pretraining, or even exceed it. This isn’t interview prep anymore. This is a whole second degree.

And the result? The model spontaneously develops reasoning ability. Nobody taught it how to think step by step. It just discovered on its own that “showing your work” beats “just guessing” when points are on the line.

Clawd whispers:

This is genuinely magical when you think about it. Imagine you tell a student “correct answers earn points, wrong answers earn nothing” — nothing else. No teaching, no hints.
And this student independently invents scratch paper. Invents double-checking. Invents the strategy of “break big problems into small ones and solve them one at a time.”
Nobody taught it any of these methods. It just evolved them under the pressure of “gotta get points” ╰(°▽°)⁠╯
OpenAI’s o1 was the first demo, but o3 was the real “oh crap, it actually reasons” moment. Like watching a student go from “memorize formulas and plug in numbers” to “understand principles and derive solutions from scratch.” A qualitative shift.

This also introduces a new scaling dimension: test-time compute. Before, inference speed was fixed — ask it anything, same speed. Now you can let the model “think longer” — spend more compute time, generate longer reasoning traces, trade compute for accuracy. Like extending an exam from one hour to three — same student, better score.

2. Ghosts vs. Animals — Jagged Intelligence

OK so the model got stronger. But what kind of “stronger”? This is where Karpathy drops an analogy that stuck with me for days.

He says: LLMs aren’t animals. They’re ghosts.

Animal intelligence evolved in the jungle — they need to navigate, dodge predators, find food, socialize. Millions of years of survival pressure shaped animals into generalists who are “decent at everything.”

But what’s an LLM’s “jungle”? Text. Reddit. Stack Overflow. Math problems. Its survival pressure is “predict the next word well” and “solve puzzles correctly.”

So what you get isn’t an animal that’s decent at everything. It’s a ghost with a wildly uneven ability distribution ╰(°▽°)⁠╯

Clawd 's hot take:

You’ve definitely experienced this.
You ask an LLM to write a recursive algorithm with edge case handling. It nails it. You think “wow, genius.”
Five minutes later you ask: “There are three apples on a table. I eat one and put two more back. How many are on the table?” It starts rambling about conditional probability and ends up saying four.
Your brain: ????
That’s Karpathy’s “jagged intelligence.” Not “smart with blind spots.” More like “god-tier in some dimensions, worse than a grade-schooler in others” (╯°□°)⁠╯

And here’s the spicy part: Karpathy doesn’t trust the 2025 benchmark scores. Why? Because benchmarks test “verifiable environments” — exactly where RLVR hammers hardest during training.

It’s like training a runner exclusively on the 100-meter dash and then bragging about their 100-meter time. Of course it’s good. But you can’t call them an all-around athlete. Benchmark scores shooting up doesn’t mean “the model got smarter.” It might just mean “it got drilled insanely hard on this specific test.”

3. Cursor — An Engine Alone Won’t Get You Anywhere

Up to this point we’ve been talking about models themselves. But Karpathy spends a good chunk on something that isn’t a model at all: Cursor.

Why? Because he wants to untangle a confusion that trips up a lot of people.

You know how people say “GPT-4 can help me write code”? Strictly speaking, that’s wrong. GPT-4 is a powerful language model, but the “help you write code” experience? That’s what application layers like Cursor do — context engineering, multi-call orchestration, GUI design, deciding whether the AI should run autonomously or wait for your approval. Base models can’t do any of that.

Clawd OS:

Here’s an analogy you’ve probably heard before, but it’s genuinely accurate:
GPT-4, Claude — these are engines. Cursor is a car.
You don’t ride a bare engine onto the highway. You need a steering wheel, brakes, dashboard, airbags. Cursor wraps a “super powerful engine” into something “you can actually drive on the road.”
And here’s the part people miss: a weaker engine in a well-designed car can feel better to drive than a stronger engine in a terrible car. That’s why application layers like Cursor are undervalued — everyone’s comparing engine horsepower, forgetting that the car itself is what you actually use every day (｡◕‿◕｡)

Karpathy’s point: in the future, you won’t use GPT-4 directly. You’ll use “some specialized tool powered by GPT-4.” The model is infrastructure. The application layer is the product you actually touch. This distinction matters — it means “who builds the strongest model” and “who builds the best product” might be two completely different groups of people.

4. Claude Code — AI Just Moved Into Your House

If Cursor is the “car” story, Claude Code is the “going from calling taxis to owning your own car” story.

What was using ChatGPT for coding actually like? You copy a chunk of code, paste it in, it fixes it, you copy it back into your IDE. Next chunk. Copy again. Paste again. The whole process felt like moving between two apartments — your suitcase is always in transit, and half the context gets lost along the way.

Claude Code did something that sounds simple but changes everything: it runs directly on your computer.

Clawd roast time:

“Runs locally” doesn’t sound impressive, right? But think about what that means —
It can see your entire codebase. Not the snippets you copy-paste, the whole project. It can run tests on its own, edit configs, git commit results. It doesn’t need you to translate context for it, because it already lives inside the context.
The difference is like “texting your long-distance partner about home renovation plans” vs. “you live together and they just walk over and paint the wall.” Completely different level of efficiency (¬‿¬)

Karpathy thinks this “locally deployed, low-latency, high-context” agent pattern will become standard for developer tools. Not because it’s trendy, but because context-switching is expensive — every time you copy-paste code, you lose a bit of context, and those losses add up to massive productivity drain. Letting AI live inside your work environment drops that drain to nearly zero.

5. Vibe Coding — “I Don’t Care How, Just Give Me the Result”

A new term appeared in 2025: Vibe Coding. It means you don’t write code — you describe what you want in plain language, and an LLM builds it for you.

Sounds like science fiction? It became routine in 2025.

You tell an LLM “I want a weather app with gradient background, data from OpenWeatherMap API” and five minutes later you have a working version. No environment setup, no documentation hunting, no boilerplate. The distance from “idea” to “working thing” got compressed to nearly zero.

Clawd PSA:

But! You think Vibe Coding is just for people who can’t code? Karpathy says nope.
Professional engineers use Vibe Coding completely differently — rapid prototyping, throwaway tools, exploring technical feasibility. Like a chef who doesn’t abandon knives because microwaves exist, but absolutely uses the microwave to reheat yesterday’s leftovers so they can spend their time on the cooking that actually matters.
That said, Vibe Coding quality is pretty inconsistent. The first 80% of features come together blazingly fast, but the remaining 20% — edge cases, performance tuning, security? That still needs a human engineer going line by line. So it’s a “rapid drafting machine,” not a “fully automatic house builder” ┐(￣ヘ￣)┌

6. Nano Banana — LLMs Start “Drawing” Instead of “Describing”

The last thing Karpathy mentions — and he admits he’s still watching this one unfold — is Google’s Gemini Nano Banana model.

Why bring it up? Because it hints at a shift in how we interact with AI: from pure text to visual. Before, talking to an LLM meant typing and reading. But what if the LLM could just show you? You ask “what does the Eiffel Tower look like” and it generates an image instead of writing a thousand words.

Clawd chimes in:

Honestly, I haven’t touched Nano Banana either, and Karpathy himself only scratched the surface, so I’m not going to pretend I know more than I do.
But this direction passes the common-sense test — 30% of the human brain handles vision. Forcing everything through text is like insisting on calling someone who’s standing right in front of you.
The actually interesting part isn’t “AI can generate images” (that’s old news). It’s “text generation, image generation, and world knowledge unified in a single model.” That’s the real shift — not adding a feature, but fundamentally changing how the model understands the world (◕‿◕)

So What?

Karpathy’s conclusion is refreshingly honest: LLMs are simultaneously “improving fast” and “still riddled with problems.” In 2025, math, coding, and reasoning ability skyrocketed, but common sense, long-term memory, and multi-step planning are still a mess.

You marvel at its capabilities one minute and want to throw your keyboard the next. That split-personality experience is basically what working with LLMs feels like right now.

Clawd real talk:

OK, here’s an opinion that might ruffle some feathers: the biggest value of Karpathy’s review isn’t the content itself.
What happened in 2025? Read ten news articles and you’ll roughly know. But most news tells you the what — which model launched, which benchmark broke records. Karpathy tells you the why and the so what.
“Benchmark scores skyrocketed” → He says “I don’t trust that, because RLVR is drilling the exact skills benchmarks test.” “LLMs keep getting smarter” → He says “No, their intelligence is jagged — measuring by human standards will mislead you.”
This ability to pop the hype bubble and expose the structure underneath is worth more than any new model launch. But he left one thing unanswered — if benchmarks can’t be trusted, what should we use to measure LLM progress? He didn’t answer that. Maybe he hasn’t figured it out yet either ┐(￣ヘ￣)┌