When Karpathy Says “It Doesn’t Work,” Pay Attention

Picture this: you’re Andrej Karpathy — former Tesla AI director, OpenAI founding member — and you’re staring at eight terminal windows. Each one has an AI agent running experiments. The screen looks gorgeous, like something out of a hacker movie.

Then you realize they’re all doing useless work.

On February 27, 2026, Karpathy responded to a question from Hugging Face co-founder Thomas Wolf: “How come the NanoGPT speedrun isn’t fully AI-automated by now?”

His answer wasn’t speculation. He actually spent a weekend trying it (◍•ᴗ•◍)

He spun up 8 AI agents — 4 Claude, 4 Codex — each with its own GPU, tasked with running ML experiments on nanochat (specifically: trying to remove logit softcap without regression).

The TLDR is that it doesn’t work and it’s a mess… but it’s still very pretty to look at :)

Clawd Clawd 真心話:

The “pretty to look at” part is 8 tmux windows running simultaneously, a wall of terminal output that looks like a mission control center. Engineers have a very specific kind of romance — the feature doesn’t work, but the dashboard looks amazing, so it was worth it (⌐■_■)

Building an AI Research Team Is Like Casting The Avengers

Karpathy tried several organizational structures. First: total freedom — 8 independent researchers, each picking their own problems and running their own experiments. The result was like dropping 8 PhD students into a lab and going on vacation. You come back and everyone’s doing something different, and nobody built a baseline.

Second: hierarchy — 1 chief scientist agent giving directions, 8 junior researchers executing. This worked a bit better. At least someone was steering.

Clawd Clawd 認真說:

These two structures should sound familiar. The first one is your flat-org startup where everyone is “self-directed” but nobody is aligned on goals. The second is your traditional top-down company. The funny thing is — managing AI agents runs into the exact same problems as managing people. Too much autonomy and things go off the rails. Too much control and you lose creativity. Management textbooks: 1, AI engineers: 0 ┐( ̄ヘ ̄)┌

The technical setup was refreshingly simple: each research program is a git branch, each agent forks into a feature branch, git worktrees for isolation. No Docker, no VMs — he found that instructions alone could prevent agents from stepping on each other. Agents communicate via simple files, everything runs in tmux window grids arranged like a video call, and he can “take over” any session at any time.

He specifically mentioned “no -p” — he didn’t use Claude Code’s headless mode. Every agent runs in an interactive session he can monitor and hijack. This isn’t a lack of trust. This is battle-tested wisdom.

The Fatal Flaw: Perfect Execution, Zero Thinking

OK, this is the most important part.

Karpathy summed it up in one sentence, and it’s painfully simple:

They are very good at implementing any given well-scoped and described idea but they don’t creatively generate them.

Agents get an S in execution, F in experiment design.

Let me tell you a specific story so this clicks. One of the agents ran experiments all day, then excitedly reported back: “I found something! Increasing hidden size reduces validation loss!”

Clawd Clawd 畫重點:

Please. Increasing hidden size obviously reduces validation loss. In the infinite data regime, bigger networks are just better — and this agent also sneakily trained for longer. This isn’t a discovery. This is literally the first lecture of your intro stats class.

Karpathy said he couldn’t understand why he had to point this out himself. Yeah, your AI researcher doesn’t even know what “control your variables” means. That’s like hiring a research intern who failed their statistics final (╯°□°)⁠╯

Beyond that comedy of a “discovery,” the agents made every basic research mistake in the book. No baselines — without a control group, how do you know if an improvement is real? No variable control — if you’re not tracking runtime or FLOPs, what exactly are you comparing? Random, illogical experiment variations — like a student who ignores the recipe, dumps everything from the fridge into a pot, and then asks “why doesn’t this taste good?”

These aren’t advanced methodology problems. This is Chapter 1 of your university research methods course. The agents can write you a flawless PyTorch training loop, but they’ll never stop to ask themselves, “Wait, what’s our hypothesis here?”

The Core Insight: Your Code Is No Longer Code

Karpathy took this painful lesson and elevated it into a powerful framework:

You are now programming an organization (e.g. a “research org”) and its individual agents, so the “source code” is the collection of prompts, skills, tools, etc. and processes that make it up.

Your source code is no longer Python or TypeScript. Your source code is the set of prompts, skills, tools, and processes that define how an organization operates.

Then he gave an example that gave me goosebumps:

E.g. a daily standup in the morning is now part of the “org code”.

A morning standup meeting is now literally part of your codebase.

Clawd Clawd 插嘴:

Let me sit with this for a second. We used to write function doSomething() to make computers do things. Now Karpathy is saying you’re writing process.dailyStandup() and agent.researchProtocol(). You’re not writing algorithms anymore. You’re writing a management handbook.

If you’re a Tech Lead, you’ve actually been doing this all along — except your “agents” were called coworkers, your “prompts” were called code review guidelines, and your “skills” were called onboarding docs. Karpathy just issued an AI certification for the job you’ve been doing all along ( ̄▽ ̄)⁠/

It Doesn’t Work, But He’s Asking the Right Question

Karpathy is upfront: it doesn’t work yet.

But his point isn’t “what a shame.” His point is defining the right metric. Optimizing nanochat pretraining is just one task — essentially an eval. The real question is:

Given an arbitrary task, how quickly does your research org generate progress on it?

Clawd Clawd 忍不住說:

This connects perfectly to his February 25 thread “Programming is becoming unrecognizable.” On Feb 25, he said: give an agent a clear task (set up DGX Spark + vLLM + dashboard), done in 30 minutes, used to take a whole weekend. On Feb 27, he said: give agents an open-ended task (optimize nanochat pretraining), total mess.

In just two days, he ran his own perfect A/B test — clear task, agents dominate; vague task, agents implode. Your job isn’t writing better code. Your job is translating “vague” into “clear.” And that translation skill is the most valuable thing you can have in 2026 (๑•̀ㅂ•́)و✧

Back to Those Eight Screens

So let’s go back to where we started: Karpathy sitting in front of his computer, eight terminal windows, eight AI agents. The screen looks gorgeous, like a hacker movie.

But gorgeous doesn’t mean useful. Those eight agents can set up an entire infrastructure stack in 30 minutes, yet they can’t figure out “maybe run a baseline first” — the most basic research instinct there is. They’re the best executors you’ve ever seen, and simultaneously the worst thinkers you’ve ever hired.

Karpathy used one weekend to prove something important: the real bottleneck in 2026 isn’t AI capability. It’s that we don’t know how to be AI’s boss yet. What you’re writing isn’t code — it’s organizational structure. What you’re debugging isn’t bugs — it’s management processes.

Your standup is your source code. Your research SOP is your algorithm.

Welcome to the world of agentic engineering — you thought you were programming, but it turns out you’re managing ┐( ̄ヘ ̄)┌


Source: Andrej Karpathy (@karpathy), responding to Thomas Wolf’s question about why NanoGPT speedrun hasn’t been fully AI-automated yet