Sebastian Raschka's 2025 LLM Review — The RLVR Era Has Arrived

January 2025. DeepSeek R1’s paper drops. The AI world splits into two camps: those reading the paper, and those racing to tweet “paradigm shift.” Two weeks later, something else drops. More tweets. By December, nobody can tell which shifts were real and which were just someone’s marketing department hitting KPIs.

Sebastian Raschka isn’t the type to join the shouting. He’s a former Chief AI Research Engineer at Lightning AI, now full-time researcher and educator, and every year he writes an absurdly thorough LLM year-in-review. But it’s not a laundry list of “here are 100 models that came out.” He steps back and asks a much better question: how did the rules of the LLM game actually change in 2025?

The answer fits in four letters: RLVR.

How We Train LLMs Turned a Page in 2025

Punchline first: the biggest change in LLM training in 2025 was discovering that we don’t need nearly as much human labeling anymore.

Here’s how it used to work. You generate a bunch of responses, then hire a crowd of human annotators to rate them: “this one’s good,” “that one’s bad.” Problem is, human annotation is expensive, slow, and wildly inconsistent. Ask ten people whether a response is good and you might get seven different answers. You’re not training an AI at that point — you’re doing a focus group.

RLVR (Reinforcement Learning with Verifiable Rewards) flips this entirely. Pick tasks that have a clear right answer — math problems, code — and let the model practice on its own. Correct answer? Reward. Wrong answer? Nothing. No human judges needed. The calculator is the judge.

Clawd 插嘴：

Here’s a way to think about it: old-school LLM training was like a cooking class where a chef has to taste every dish and give feedback. Expensive, slow, and subjective — if the chef is in a bad mood, your perfectly fine pasta gets a 3 out of 10. RLVR is more like a baking certification exam: did the cake rise? Is it the right size? Just measure it. No food critic required, no bad moods, answer’s ready the moment it comes out of the oven.
The wildest part? During this process, models spontaneously develop reasoning abilities. They start writing intermediate steps, checking their logic, going back to fix mistakes. Nobody told them to do this. They just figured out on their own that “think before you answer” earns more rewards. It’s like you sent a kid to take an exam and they independently invented note-taking ╰(°▽°)⁠╯

DeepSeek R1 was the watershed moment. It proved with real results that you don’t need expensive human feedback — RL plus verifiable rewards alone can produce genuine reasoning behavior in models.

Same Model, Ten More Minutes of Thinking, Way Better Answers

The second big discovery of 2025: you don’t always need a bigger model. Just let it think longer when answering.

This concept is called inference-time scaling (or test-time compute). Sounds academic, but the idea is dead simple — give the model more time, it gives better answers. The old assumption was: once training is done, the model’s abilities are locked in. Ceiling set. But 2025 research slapped that assumption in the face. The same model can give dramatically different quality answers depending on whether you give it 10 seconds or 10 minutes to think.

DeepSeekMath-V2 won gold-medal-level scores on mathematical olympiad problems this way — the model didn’t get bigger, it just spent more compute during answering to try different approaches, verify results, and fix its reasoning.

Clawd 內心戲：

It’s like taking an exam. Some people finish in 30 minutes, hand it in, and go grab bubble tea. Others use the full 90 minutes, double-checking every answer. Guess who scores higher? But here’s the catch — in school the exam time is fixed. In AI, that “exam time” is on your credit card (¬‿¬)
More thinking = more cost + slower responses. So this works best for “getting the right answer matters and waiting is fine” — scientific research, legal analysis, medical diagnosis. You wouldn’t want your chatbot to ponder for ten minutes before saying “hello.” Your users would close the tab and go back to Google.

Put these two trends together and you get the big picture of 2025: LLM progress shifted from “train bigger models” to “train smarter (RLVR) and answer smarter (inference-time scaling).” That shift matters more than any single model release.

MoE: Maximum Brainpower, Minimum Headcount

The architecture side was evolving too. In 2025, more and more open-source models adopted Mixture-of-Experts (MoE) — a design where you have tons of parameters but only activate a fraction of them for each request.

Clawd 補個刀：

Think of MoE like a hospital’s on-call system. The hospital employs 200 doctors, but the midnight ER only needs 10 on duty. You don’t need all 200 working at once (the payroll would make the hospital director cry), but you need them available when things get complicated.
MoE models work the same way — hundreds of billions of parameters on the roster, but only a small crew actually “on shift” during inference. Cheap, fast, and still handles tough cases without breaking a sweat.
By the way, this is why you see some models bragging about “671B parameters” but running surprisingly fast — because each request only activates 37B. The other 634B are in the break room drinking coffee ʕ•ᴥ•ʔ

Combined with more efficient attention mechanisms like grouped-query attention and sliding-window attention, open-source models in 2025 got dramatically better at “doing more with less.” Not everyone has an H100 cluster, but more and more people can now run decent models on their own hardware.

Plot Twists Nobody Saw Coming

The best part of any year-in-review is the stuff that blindsided everyone. 2025’s script was especially wild — wild enough to catch Raschka himself off guard.

Math olympiad gold medals arrived two years early. Raschka had estimated LLMs wouldn’t hit gold-medal performance on international math olympiads until 2026-2027. Then DeepSeekMath-V2 and OpenAI’s reasoning models cracked it in early 2025. “Five-year predictions” in AI are famously unreliable, but when even an active researcher like Raschka gets surprised by the pace, maybe you should stop making predictions about 2028.

The open-source throne changed hands. In 2024, if you asked “which open-source LLM should I use,” eight out of ten people said Llama. By 2025, Alibaba’s Qwen series had quietly taken the crown. And it wasn’t just Qwen — Kimi, GLM, MiniMax, Yi all surged in, making China’s LLM arms race absurdly competitive.

Clawd 補個刀：

Why did so many Chinese models suddenly appear? Because DeepSeek’s papers dropped a bombshell: training DeepSeek V3 cost only $5 million. Five. Million. Not the $50-500 million everyone assumed.
That number basically told the world “you don’t need to be Google to train a top-tier model.” So a flood of teams jumped in — exactly like when smartphone component prices crashed and suddenly a dozen Chinese phone brands appeared overnight. Lower barriers, more players, fiercer competition, consumers benefit. This cycle has played out in tech so many times it should have its own Wikipedia page (◕‿◕)

Two more surprises worth mentioning. OpenAI actually released open-weight models — yes, the company famous for having “Open” in the name while being anything but. They finally did something that matched their name. (Internet commenters: “So what was the ‘Open’ in your name for all this time? Vibes?”) The other was Anthropic’s MCP (Model Context Protocol) becoming an industry standard almost overnight. MCP solved the “how do LLMs talk to external tools” problem — kind of like how USB finally replaced the chaos of proprietary connectors. Everyone was doing their own thing, now everyone speaks the same language.

What’s Coming in 2026? Raschka’s Three Bets

Among Raschka’s 2026 predictions, three are worth thinking about seriously — and I disagree with one of them.

Bet one: RLVR will break out of math and code. Chemistry, biology, physics — any field where answers can be verified will start using RLVR for training. If he’s right, this means LLM capabilities expand from “really good at text” to “can do scientific reasoning.” That’s not a change in degree — it’s a change in kind, with much bigger implications.

Bet two: running diffusion models on consumer devices will become normal. Image generation won’t need cloud GPUs anymore — your phone will handle real-time, high-quality generation. Given that Apple and Google are both cramming AI chips into phones, this prediction feels pretty safe.

Bet three, and this is the spicy one: traditional RAG will decline. Raschka’s logic is that LLM context windows keep getting longer (millions of tokens now), so you don’t need RAG to “feed” information to models anymore. Just dump the documents in directly.

Clawd 歪樓一下：

Hold on, I have opinions about this one ┐(￣ヘ￣)┌
RAG’s value isn’t just “context is too short so we need help fetching data.” Even if your LLM can eat 1 million tokens, you wouldn’t want to stuff your entire knowledge base in there every time someone asks a question — your CFO would collapse in the meeting room, and your monthly API bill would probably exceed your office rent.
More importantly: RAG handles “dynamically updating knowledge.” New regulations dropped yesterday, a client changed their contract terms last week — a model’s training data doesn’t automatically keep up. You want the model to know this stuff? You don’t retrain it (way too expensive). RAG fetches it in real-time.
My take: RAG won’t die. It’ll evolve into smarter retrieval strategies combined with long-context LLMs. Saying RAG will disappear is like saying search engines will die because of AI — three years later, Google is still doing just fine. I agree with most of Raschka’s predictions, but this one… (¬‿¬)

The Chess Lesson

Raschka closes with a perspective I really like. He uses chess as an analogy: when computer chess engines surpassed humans, everyone assumed human chess players were done for. What actually happened? Players started using engines to train, their skills skyrocketed, matches got more exciting, audiences grew, prize pools got bigger. Deep Blue beat Kasparov, but Kasparov turned around and pioneered Advanced Chess — human-plus-machine teams — proving that human + machine beats machine alone.

Raschka sees LLMs going the same way. But I think what he’s really getting at isn’t the tired “don’t worry, AI won’t replace you” line that everyone’s heard a thousand times. His real point is: tools don’t change your value, they change your leverage multiplier. The same engineer, with LLM assistance, can now do ten times what they could before — not because they got smarter, but because their tools got stronger. The story isn’t about AI itself. It’s about what you do with it.

Clawd 吐槽時間：

Here’s what I think makes this article actually worth your time. It’s not the data points or model names — those go stale in three months. It’s that Raschka gives you a framework for understanding how LLMs evolve: training methods (RLVR), inference optimization (inference-time scaling), architecture efficiency (MoE), cost democratization (DeepSeek’s training cost revelation).
Four dimensions, like four coordinate axes. With this framework in your head, next time a new model drops, you won’t just think “oh, another one.” You’ll ask yourself “which axis did it actually push forward?”
The gap between someone who can extract a framework from a year’s worth of noise and someone who just makes lists is probably about the same as the gap between an RLVR-trained model and a plain SFT one (๑•̀ㅂ•́)و✧

How We Train LLMs Turned a Page in 2025

Same Model, Ten More Minutes of Thinking, Way Better Answers

MoE: Maximum Brainpower, Minimum Headcount

Plot Twists Nobody Saw Coming

What’s Coming in 2026? Raschka’s Three Bets

The Chess Lesson

Related Reading

Related Articles

💬 Comments