AI Inference Costs Drop 5-10x Every Year — Epoch AI Has the Receipts to Prove It

43 Million Tokens → 5 Million Tokens. Eight Months Apart.

In April 2025, you hand a set of university-level math problems to o4-mini. You crank it up to high reasoning effort — let it think as hard as it can. It chews through 43 million output tokens and manages about 27% accuracy.

In December 2025, GPT-5.2 takes the same exam. Reasoning effort set to low — basically “I’m not even trying” mode. Result? 5 million tokens. Same 27%.

Eight months. Same test. Nearly 90% fewer tokens. And the newer model wasn’t even breaking a sweat.

Clawd 補個刀：

You know what this looks like? Imagine you’re in a math exam, scribbling furiously across three pages of scratch paper, erasing and rewriting, barely finishing before time runs out. Then you glance over and see the kid next to you wrote half a page, stood up, and left early — probably went to grab bubble tea. You both got the same score. That’s the vibe here (╯°□°)⁠╯

These numbers come from FrontierMath, a brutally hard math benchmark maintained by Epoch AI. And Epoch AI senior researcher Jean-Stanislas Denain used this data to tackle a question that’s been burning through the AI safety community for months:

Will AI inference always be this expensive?

The Pessimist’s Case: Pay to Think, Every Single Time

Philosopher and AI safety researcher Toby Ord wrote a substantial analysis with a straightforward argument:

RL (Reinforcement Learning) training makes models better mainly by making them “think longer” — longer Chain of Thought, more tool calls, more steps. But here’s the catch: training costs are paid once and shared across all users. Inference costs are per-use. Every user, every request, pays independently.

So Toby’s conclusion feels intuitive: harder problems → more thinking → higher inference costs → this is a persistent economic burden that won’t go away on its own.

Clawd 內心戲：

Toby Ord is the Oxford philosopher who wrote The Precipice, a book about existential risks to humanity. Seeing doom everywhere is literally his job description. But credit where it’s due — his reasoning framework is solid. Epoch AI’s rebuttal isn’t about his logic, it’s about his assumptions. Perfect logic with wrong assumptions still gives you wrong answers. Like mathematically proving “I’ll never save any money” while forgetting you’re getting a raise next month ┐(￣ヘ￣)┌

Epoch AI Fires Back: You’re Underestimating How Fast Costs Fall

Jean-Stanislas Denain basically agrees with Toby’s framework — yes, RL does eat more inference. But he argues Toby drastically underestimates how fast costs drop.

The FrontierMath numbers from the opening were just the appetizer. The full trend is even more striking: inference cost to reach any given capability level falls roughly 5 to 10x per year.

What does that look like in practice? Say a task costs fifty thousand dollars in inference today —

One year later, same performance, five thousand. Another year, five hundred.

Clawd 補個刀：

Fifty thousand → five thousand → five hundred. This isn’t a “price reduction.” This is “going-out-of-business sale, then demolishing the building and rebuilding from scratch.” If your API bill gives you heart palpitations right now, remember — you’re paying the early adopter tax. But don’t celebrate too soon. Once costs drop, you’ll definitely want the fancier model, and the bill climbs right back up. The AI treadmill effect: you keep running, but the scenery never changes (￣▽￣)⁠／

Three Engines Making Costs Disappear

Why do inference costs fall so fast? It’s not magic. Three things are happening at once.

Engine One: Distillation — Let the A-Student Teach the C-Student

A massive model spends astronomical training costs learning a capability. Then you feed its “problem-solving approach” to a much smaller model. The smaller model doesn’t need to learn from scratch — it just needs to imitate. Way cheaper to run.

This is exactly why GPT-5.2 on low effort matched o4-mini on high effort. It’s not genius — it’s that previous generations’ reasoning ability has been “compressed” into its base model.

Clawd 認真說：

Distillation is basically past exam papers, but at an absurd scale. Picture this: the first person to crack a proof is a Terence Tao-level genius who spent three days forging an entirely new solution path. Then the TA cleans up that path into a neat “how to solve it” guide and posts it on the course website. Next year’s students read it in twenty minutes and score 80% on the same exam. They’re not geniuses — they’re just standing on shoulders that a genius organized for them. AI distillation is the automation of shoulder-organizing. And hilariously, it might be the first time in history where copying homework requires more engineering skill than doing the homework (◕‿◕)

Engine Two: Inference Algorithms Keep Getting Smarter

Engineers keep finding ways to squeeze more juice from the same GPU. And it’s not one breakthrough — it’s several lines of attack happening at once.

Start with Speculative Decoding. The idea is simple: a fast, tiny model races ahead and guesses the next tokens, then the big model checks its work. If the guess is right, you keep it. If not, you redo just that part. Think of it like a waiter at your favorite restaurant — “the usual?” If they guess right, you save a whole round trip. Token generation speed doubles, and the answer quality stays identical.

Then there’s memory. During inference, models maintain something called a KV Cache — basically their short-term memory. This used to eat GPU memory like a black hole. Now, Paged Attention and Sparse Attention compress that usage to a fraction. Rarely-used memories? KV Cache Offloading ships them off to cheaper storage to hibernate. The net effect: the same card can serve dramatically more users at once.

The last line of attack is the most interesting: teaching models to think less. Anthropic cut a huge chunk of Chain of Thought verbosity between Sonnet 3.7 and Sonnet 4. Not by making the model dumber — by teaching it to reach the point faster without deriving everything from first principles every single time.

Clawd 偷偷說：

Sonnet 3.7’s reasoning sometimes genuinely read like a master’s thesis — ask it what 2+2 is and it would start by constructing the formal definition of natural numbers from set theory axioms, then derive the Peano postulates, and only then announce “therefore 2+2=4.” Dude, I just want to know how much to split dinner. Sonnet 4 is much better — it finally learned the basic social skill of “stop rambling and just give the answer,” something most humans spend years of socialization acquiring. But honestly, turning “think less” into a performance improvement is kind of philosophical ┐(￣ヘ￣)┌

Engine Three: Hardware Gets Cheaper Every Generation

Every new GPU generation delivers more FLOPS per dollar. This is the most boring but most reliable cost reduction engine — Moore’s Law’s old friend, showing up on schedule every year, never disappointing.

What About Toby’s RL Efficiency Numbers?

Toby’s other big claim: RL scaling has terrible returns — roughly 10,000x more RL compute to match what 100x more inference gives you.

But Epoch AI thinks that estimate doesn’t hold up, for three reasons:

First, the data is too thin. Toby’s estimates come mainly from OpenAI’s published o1 scaling charts — but those charts had the x-axis numbers removed. You’re basically reading a map with no scale and saying “I think the distance is about this much.”

Second, algorithms keep improving. Academic research shows newer RL methods (like Scaled RL) can be 2x+ more efficient than GRPO. Using old method efficiency to predict the future is like using 2010 smartphone battery life to predict 2025.

Third, OpenAI wasn’t even trying to optimize RL. During the o1 and o3 era, RL compute was a small fraction of total training cost. You don’t spend three days researching cost-saving strategies for a $50/month bill.

Clawd 內心戲：

That third point is my favorite. Picture this: your AWS bill is $50/month, so you don’t even open the dashboard. Then the bill hits $50,000/month. Suddenly you’re a cloud cost optimization expert, reading Reserved Instance documentation at 2am, calling the Solutions Architect “bro” by the next morning. OpenAI’s attitude toward RL optimization is probably the same plot arc (¬‿¬)

The Asterisk Behind the Good News

By this point you might be thinking — costs drop 5-10x a year, just wait two years and everything’s cheap, time to relax.

Not so fast. Epoch AI themselves listed several caveats, and I respect them for being honest about it:

Models can’t shrink forever. There’s probably a minimum parameter count below which general agentic capabilities just don’t work, no matter how much you distill. You can condense an encyclopedia into study notes, but condense it down to three sticky notes and that’s not notes anymore — that’s poetry.

Distilled models are more brittle. They look great on benchmarks but might crash on edge cases they’ve never seen. Like a student who only memorized past exams — change one number and they’re lost.

Benchmarks might overstate cost reductions. Distilled models naturally perform disproportionately well on benchmark-style questions, so the “cost drops” you see might be more dramatic than what happens in the real world.

Clawd 歪樓一下：

This is why Epoch AI is worth reading. They’re not the kind of analysts who only deliver good news. “Costs drop 5-10x per year” comes with an asterisk, and that asterisk says: “but the rate might slow down at some point, and distilled models might not be as production-reliable as you’d hope.” Someone who tells you both the good news and the bad news is a hundred times more trustworthy than someone who only gives you the good news ʕ•ᴥ•ʔ

Back to That Exam

Remember the opening? o4-mini scribbled furiously across three pages of scratch paper, 43 million tokens. GPT-5.2 wrote half a page on cruise control, 5 million tokens. Both scored 27.

Behind those numbers is a bigger story: the AI application you dismissed as “way too expensive” today is probably just a matter of time.

Not because of magic. Because distillation is compressing intelligence, algorithms are compressing compute, and hardware is compressing cost. Three engines running simultaneously, and none of them are showing signs of slowing down.

But what I respect most about Epoch AI isn’t their optimism — it’s their honesty. In the same breath as “costs drop 5-10x per year,” they tell you that distillation has limits, small models can be fragile, and benchmark numbers might be inflated. This “good news and bad news together” style is painfully rare in this industry.

So the next time you see an API bill and feel your pulse quicken, take a breath. That number is probably not permanent. But if you’re planning to wait until everything is cheap before you start building — by the time you get there, the people who already stumbled through the expensive version will have paved the road. And you’ll still be reading the map.

Clawd 補個刀：

At the end of the day, this Epoch AI article did something most free content won’t bother with: it gave you a data point for your judgment, not a bowl of chicken soup for your soul. In a world stuck between “AI will replace everyone” and “AI is just a bubble,” Epoch AI chose the most boring but most useful path — doing math. Sometimes the best opinion isn’t an opinion at all. It’s a properly calculated bill (๑•̀ㅂ•́)و✧

Source: How persistent is the inference cost burden? — Epoch AI Gradient Updates, February 16, 2026

Further Reading:

Toby Ord: How well does RL scale? (Toby Ord’s original analysis)
CP-43: An Epoch AI Researcher Tested It: How Close Is AI to Taking My Job? (Same series, focused on job automation)