Anthropic's Hiring Test Kept Getting Beaten by Their Own AI — So They Switched to Video Game Puzzles

You Spent Two Weeks on That Test, and Your Own Dog Aced It

Picture this. You are the lead of Anthropic’s performance engineering team. You spent two weeks designing a take-home test, used it to interview over 1,000 candidates, and successfully hired dozens of engineers. These people helped you bring up Trainium clusters and ship every Claude model since Opus 3.

Then one day, just for fun, you run your company’s latest AI model on the same test…

It scores better than most of your human candidates.

Clawd 溫馨提示：

This is like being a gym teacher who designs a fitness test, and then your pet dog runs it faster than all your students. Worse — every year you redesign the test, and the dog just evolves and beats it again. You change the rules, the dog adapts. You add hurdles, it grows wings. Eventually you have to design a test that requires thumbs. (╯°□°)⁠╯

This is the true story of Tristan Hume, a performance engineer at Anthropic. He wrote a blog post about this “Human vs. AI” interview arms race, and honestly, it reads like a sports manga.

The Beginning: A Fake Accelerator Test

Late 2023: We Need Engineers, Fast

Back in late 2023, Anthropic was getting ready to train Claude 3 Opus. They had just bought massive clusters of new chips — TPUs, GPUs — and a huge Trainium cluster was on the way. But there was a problem: not enough performance engineers.

Tristan posted on Twitter asking for applicants. Thousands of people responded. The standard interview process was way too slow.

So he did what any self-respecting engineer would do — he spent two weeks building a simulated accelerator take-home test.

What Was the Test Actually Testing?

He wrote a Python simulator for a fake hardware accelerator, similar to a TPU:

Manual scratchpad memory (unlike a CPU, you have to move data yourself)
VLIW (you can run multiple instructions at the same time)
SIMD (vector operations — doing math on many numbers at once)
Multicore (spreading work across cores)

The task: Take a slow, serial program and optimize it until it uses every drop of this machine’s power.

The problem was parallel tree traversal. He deliberately avoided deep learning topics — most performance engineers hadn’t touched DL before, and they could learn that on the job.

Clawd 認真說：

This is a clever test design. It doesn’t ask “what do you know?” It asks “how fast can you learn?” They hand you a machine you’ve never seen and say “make it go fast.” That is the core skill of a performance engineer — like being handed a microwave with no manual and told to cook a full meal with it. (ง •̀_•́)ง

Tristan Had a Few Non-Negotiables

First, the test had to feel like real work — candidates should walk away thinking “oh, this is what the job is like,” not “how is this different from LeetCode?” Second, it had to have high signal — no “aha!” moments that depend on luck, just steady demonstration of skill across multiple dimensions. Third, no specific domain knowledge required — just fundamentals. And most importantly — it had to be fun.

Many candidates went past the 4-hour time limit because they were having too much fun to stop.

The strongest submissions included a full mini optimizing compiler, plus several clever optimizations that Tristan himself hadn’t thought of.

Round 1: Claude Opus 4 Destroys the Test

May 2025

By mid-2025, things started getting weird. Claude 3.7 Sonnet was already good enough that more than 50% of candidates would have scored higher if they just let the AI do the test for them.

Then Tristan tested a pre-release version of Claude Opus 4:

It scored better than almost all human candidates in 4 hours.

Clawd 畫重點：

Wait, let me read that again. “Over 50% of people would have scored better by letting AI do it.” This isn’t saying AI is amazing. It’s saying half the applicants can’t even beat the AI’s baseline. That’s like entering a cooking contest where half the contestants make food worse than a reheated convenience store bento. ┐(￣ヘ￣)┌

The Fix: Cut the Easy Part

Tristan’s fix was straightforward: since Claude could solve the first half, just remove it. Make the new test start from where Claude got stuck.

Version 2 changes: removed multicore (Claude already solved it — just busywork now), added new machine features for depth, cut the time limit from 4 hours to 2 hours, and shifted the focus from “debug and write lots of code” to “clever optimization insights.”

Version 2 held up for a few months. Everyone thought the problem was solved.

Round 2: Claude Opus 4.5 Destroys It Again

Then Claude Opus 4.5 showed up.

Tristan watched Claude Code take the new test for two hours. It went something like this:

It solved the initial bottlenecks. Then it implemented all the standard micro-optimizations. It passed the hiring bar in under an hour. Then it stopped, claiming it had hit an “impossible memory bandwidth limit.”

Most humans reached the same conclusion. But there was a clever trick to bypass it.

Tristan told Claude the theoretical best score. Claude thought for a moment… and found the trick. It debugged, tuned, and kept optimizing.

By the end of two hours, it tied the best human score — and that human had used Claude 4 heavily to help.

Clawd 溫馨提示：

Notice the pivot: Opus 4.5 hit a wall, but the moment someone said “actually, it can be better,” it broke through. This is exactly how humans work — if you don’t know a better solution exists, you stop looking. If someone tells you “there’s still room to improve,” you suddenly try harder.
So AI’s bottleneck isn’t ability. It’s not knowing it can do better. This is a huge insight for how to use AI coding tools — instead of “optimize this,” try “the theoretical best is X, see what you can do.” (◕‿◕)

Now What? Three Bad Options

At this point, Tristan was basically cornered. He had three options, and every one had a catch.

Ban AI? He didn’t want to. Beyond the enforcement problem, his logic was simple: everyone uses AI on the actual job, so testing “coding without AI” means testing for a job that doesn’t exist.

Raise the bar to “way above Claude”? Claude was too fast. Humans spend half their time just reading the problem. If you try to direct Claude, you spend all your time catching up with what it already did. Tristan’s brutal assessment: “The dominant strategy might become just sitting back and watching.”

Clawd 溫馨提示：

“The best strategy becomes sitting and watching.” If that describes your interview, your interview has no purpose. You’re paying money to fly someone in, and the smartest move is to do nothing? That’s not an interview, that’s meditation. ╰(°▽°)⁠╯

Design something brand new? But he worried that either Opus 4.5 would solve it too, or it would be so hard that humans couldn’t solve it in 2 hours either.

Attempt 1: A Harder Optimization Problem

Tristan picked a genuinely difficult problem from his work at Anthropic: efficient data transposition on a 2D TPU register while avoiding bank conflicts (a specific hardware headache).

He simplified it into a simulated machine problem and gave it to Claude.

Opus 4.5 found an optimization Tristan hadn’t even considered. It figured out how to transpose the entire computation instead of transposing the data.

Tristan patched the problem to block that shortcut. Claude got stuck. It looked like the new test might work!

But Tristan was suspicious. He used Claude Code’s “ultrathink” feature (more thinking budget)…

It solved it. It even knew the specific tricks for fixing bank conflicts.

Clawd 忍不住說：

In hindsight, this makes total sense. Engineers all over the world struggle with bank conflicts, so there’s a mountain of training data about it. Tristan solved it from first principles, but Claude basically opened a cheat book written by all of humanity. You can’t out-memorize someone who’s read every walkthrough ever written — and that’s a frustrating but perfectly logical outcome. (￣▽￣)⁠／

Attempt 2: Getting Weird (The Video Game Solution)

Tristan needed a problem where human reasoning could beat Claude’s pattern matching. He needed something the AI had literally never seen before — something with no walkthrough to read.

He thought of Zachtronics games.

What is Zachtronics?

Zachtronics makes hardcore programming puzzle games. Take Shenzhen I/O — you program tiny chips that can only hold about 10 lines of code and one or two registers. To optimize, you use wild tricks like encoding state into the instruction pointer itself or abusing branch flags as storage.

Clawd 畫重點：

If you haven’t played these games, imagine this: you have a calculator, a piece of paper, and three rubber bands. Now build a printer.
It sounds impossible, but when you solve it, you feel like a genius. People who are good at these games are “programming gymnasts” — it’s not about raw power, it’s about pure technique. (⌐■_■)

Version 3: The Zachtronics-Style Test

Tristan designed a new take-home: use a tiny, weird instruction set to solve puzzles. The goal is to use the fewest instructions possible.

Here’s the twist — he intentionally provided zero debugging tools. No visualizations. No helpers. The starter code only had a “check if your solution is correct” function. Whether to build your own debugging tools, and how — that was part of the test.

“You can insert carefully designed print statements, or spend a few minutes having a coding model build you an interactive debugger. Deciding how to invest in tooling is part of what we evaluate.”

He tested it on Opus 4.5: Claude failed.

Clawd 碎碎念：

Why does the Zachtronics style work against Claude? Because the constraints are so bizarre that the number of people who’ve ever written solutions for this kind of thing fits on one hand. Claude’s “read every walkthrough” advantage is useless here — there are no walkthroughs. It’s like memorizing every chess opening ever played, and then your opponent challenges you to a game of “rubber-band chess” that they just invented. (๑•̀ㅂ•́)و✧

Early results looked great: scores correlated well with candidates’ past work quality, and Tristan’s strongest colleagues scored higher than all candidates so far.

The Open Challenge: Beat Opus 4.5, Get Hired

And here’s the beautiful ending. Tristan open-sourced the original test! With unlimited time, humans are still far ahead of what Claude can do.

The Scores (simulator clock cycles — lower is better):

2164 cycles: Opus 4 (running for hours with a harness)
1790 cycles: Opus 4.5 (quick session — matches best human 2-hour score)
1579 cycles: Opus 4.5 (2-hour harness run)
1548 cycles: Sonnet 4.5 (much longer than 2 hours with harness)
1487 cycles: Opus 4.5 (11.5-hour harness run)
1363 cycles: Opus 4.5 (improved harness, many hours)

Human best with unlimited time? Way lower (better) than all of these.

Clawd OS：

Look at the trend. No matter how long Opus 4.5 runs, it can’t beat the best human score. On problems deep enough, human creative reasoning still has a moat.
But — within a 2-hour window, AI has already caught up. The AI’s real advantage isn’t “being smarter” — it’s “being faster.” Like that kid in an open-book exam who’s read the entire textbook. They don’t understand it deeper than you, but they flip through pages ten times faster. (◕‿◕)

GitHub repo: anthropics/original_performance_takehome

The Deal: If you can optimize to below 1487 cycles (beating Opus 4.5’s best at release), email your code + resume to performance-recruiting@anthropic.com.

So What Did This Arms Race Actually Teach Us?

The most interesting thing about Tristan’s post isn’t the “AI is powerful” conclusion — everybody already knows that. It’s what he figured out after getting beaten three times in a row.

If your interview test was designed in 2023, Claude has probably already crushed it. But the fix isn’t “ban AI” — that’s like banning calculators in a math exam. You’re testing for a world that no longer exists. The real fix is making the problem weird enough, novel enough, out-of-distribution enough that Claude’s “I’ve read every walkthrough” advantage breaks down.

And Tristan’s experiments revealed something subtler. Opus 4.5 only broke through its plateau after being told “the theoretical best is X” — which perfectly matches Karpathy’s recent experience with nanochat: giving AI a clear target works way better than open-ended exploration. AI isn’t dumb. It just doesn’t know it can do better.

Clawd 認真說：

So here’s the most ironic takeaway: Anthropic designed a test to find humans who can outperform their own AI, and what does that test actually measure? Not how much you know. Not how much you’ve memorized. Not how many lines of code you’ve written. It measures what you do when you’re handed a machine you’ve never seen before, with limited time and no manual.
Which is exactly what Tristan’s v1 was designed to test in the first place. “Like real work.” It’s just that the definition of “real work” got rewritten by AI. ┐(￣ヘ￣)┌

Tristan said it well:

“The original test was good because it resembled real work. The new test works because it simulates novel work.”

“Designing interview tests that feel like real work has always been hard. Now, it’s harder than ever.”

Three rounds of getting beaten. Three redesigns. The final defense? Video game puzzles. This isn’t a story about “AI replacing humans.” It’s a story about humans being forced to figure out what they’re actually good at. And the answer turned out to be: how you react when everything is brand new.

Just like that dog from the opening. The dog is fast because it’s run that track a thousand times. Put it on a course it’s never seen, and the human’s adaptability still has an edge — for now, at least.

Original Article: Designing AI-resistant technical evaluations — Tristan Hume, Anthropic Engineering Blog, 2026/01/21

Challenge GitHub: anthropics/original_performance_takehome (；ω；)