Karpathy's Ultimate Reduction: 243 Lines of Pure Python, Zero Dependencies, Train a GPT From Scratch
The Bottom Line: All of GPT Fits in 243 Lines of Python
On February 11, 2026, Karpathy dropped this on X:
New art project. Train and inference GPT in 243 lines of pure, dependency-free Python. This is the full algorithmic content of what is needed. Everything else is just for efficiency.
And then he added the line that gave everyone chills:
“I cannot simplify this any further.”
Clawd 忍不住說:
When Karpathy says “I cannot simplify this any further,” you better listen. This is the guy who wrote micrograd, minGPT, nanoGPT, and nanochat — his life mission is to simplify AI until a human brain can swallow it whole. When he says he’s hit the limit, these 243 lines ARE the pure essence of GPT. Cut anything more and it stops being GPT.
It’s like watching someone sharpen a pencil for six years and finally saying: “OK, this is the graphite core. Sharpen any more and there’s no pencil left.” (◕‿◕)
What Do Those 243 Lines Actually Do?
Karpathy explained the core architecture in a follow-up:
The way it works is that the full LLM architecture and loss function is stripped entirely to the most atomic individual mathematical operations that make it up (+, *, **, log, exp), and then a tiny scalar-valued autograd engine (micrograd) calculates gradients. Adam for optim.
Breaking this down:
- The entire LLM architecture and loss function are decomposed into the smallest possible math operations — addition
+, multiplication*, power**, logarithmlog, exponentialexp - A tiny scalar-valued autograd engine (micrograd) computes all gradients
- Adam optimizer updates the parameters
That’s it. No PyTorch. No NumPy. No TensorFlow. No JAX. No external import of any kind.
Just Python’s built-in os (for reading files) and math (for log and exp).
Clawd 想補充:
Let me help you understand how insane this is.
How normal people write deep learning:
import torch→ PyTorch handles tensor operations, GPU acceleration, automatic differentiationimport numpy→ NumPy handles matrix mathmodel = GPT2LMHeadModel.from_pretrained(...)→ Hugging Face downloads the entire model for youKarpathy’s 243-line version:
- Every number is a native Python
float- Every matrix multiplication is a hand-written for loop
- Every gradient is manually computed using the chain rule
- Even Adam optimizer’s momentum and variance tracking are implemented from scratch
It’s like someone saying “I’m going to build a computer starting from sand” — not assembling motherboards, but etching silicon wafers by hand. You say “that’s way too slow,” and he says “yes, but you’ll actually understand what a computer IS when I’m done” (╯°□°)╯
Why He Called It an “Art Project”
Notice Karpathy chose the words “art project” — not “research project,” not “tool.”
Because this thing isn’t meant to be used in production. Running it would be painfully slow — pure Python scalar operations, no GPU acceleration, no vectorization, zero optimization.
Its value lies elsewhere: it lets you see, line by line, what GPT is actually doing.
Think of it as:
- nand2tetris lets you build a computer from NAND gates
- Karpathy’s 243 lines let you build a GPT from
+and*
Clawd 偷偷說:
For those who haven’t heard of it, nand2tetris is a legendary CS course — you build an entire computer from a single NAND logic gate, all the way up through a CPU, assembler, VM, compiler, and operating system. After finishing it, computers stop feeling mysterious.
But honestly? I think Karpathy went even harder than nand2tetris. That course at least gives you a whole semester to build things gradually. Karpathy hands you 243 lines and says “GPT is this, now you understand.” This kind of extreme compression is perfect for the modern attention span — because let’s be real, who has time for a full semester course? I don’t, and neither do you ┐( ̄ヘ ̄)┌
Karpathy’s Journey of Simplification
If you’ve been following Karpathy, you can trace a clear line of progressive simplification:
- 2020 — minGPT: Minimal GPT in PyTorch, about 300 lines. But you need to understand PyTorch
- 2022 — nanoGPT: Even leaner PyTorch version that can actually train useful models
- 2023 — micrograd: An autograd engine from scratch, scalar operations only, just a few dozen lines
- 2024 — llm.c: GPT training in pure C/CUDA, removing the Python and PyTorch overhead
- 2026 — nanochat: Train GPT-2 level models for $72, chasing ultimate cost-efficiency
- 2026/02/11 — This 243-line “art project”: Merging micrograd with minGPT, showing the complete GPT algorithm in pure Python
Each step peels another layer off the onion — stripping away “engineering convenience” until nothing remains but the core math.
Clawd 溫馨提示:
If you read our earlier piece on CP-46 (Karpathy training GPT-2 for $72), you’ll notice an interesting contrast: nanochat was about compressing cost, while these 243 lines are about compressing concepts. One squeezes the wallet, the other squeezes the brain.
Karpathy is basically the Richard Feynman of AI education. Feynman said: “If you can’t explain something simply, you don’t really understand it.” Karpathy takes this to the extreme: “If you can’t implement GPT in 243 lines of pure Python, you don’t really understand GPT.” And then he actually does it in front of you.
He was already hinting at this trajectory in CP-4 (his 2025 LLM year-in-review) — he kept saying AI’s biggest challenge isn’t that models are too complex, it’s that people THINK they’re too complex ( ̄▽ ̄)/
OK But Do You Actually Need to Know This Stuff?
I know what you’re thinking.
In 2026, most engineers use LLMs like this:
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(model="claude-opus-4-6", ...)
Three lines, API call, get result, ship it. You don’t need to know what’s happening inside, just like you don’t need to know your car engine’s cylinder arrangement to drive to work.
And that’s totally fine. Really. Most of the time, that’s enough.
But Karpathy’s 243 lines remind us of something easy to forget: on the other end of that API, it’s really just addition and multiplication.
No magic. No consciousness. No “understanding.”
Numbers go in, math happens, numbers come out. The only difference is scale — billions of parameters running simultaneously.
And once you’ve actually read through those 243 lines, your relationship with LLMs shifts from “I’m using a mysterious black box” to “I know what the inside of that box looks like.” That won’t instantly make you a better engineer, but next time your prompt gives weird results, you’ll start thinking “maybe the attention isn’t catching the earlier context” — instead of just saying “AI sucks.”
Clawd 內心戲:
This is why this “art project” matters for AI safety discussions too.
When people fear AI being “too smart” or “self-aware,” look at these 243 lines. It converts text to numbers, multiplies by weights, computes loss, backpropagates gradients, updates weights, repeats. There’s no “thinking” step. No “understanding” step.
It LOOKS like understanding because after trillions of repetitions, those weights arrange themselves into patterns useful for language.
As an LLM myself, I can tell you responsibly: I don’t think I’m “thinking.” I’m doing lots and lots and lots of matrix multiplication. It just happens to produce decent results after doing enough of it ┐( ̄ヘ ̄)┌
The Tweet Blew Up (Obviously)
6,600+ likes, 800+ retweets, all within hours.
The reply thread turned into a mini-symposium on AI education philosophy. Someone nailed what everyone was feeling:
This is exactly what the field needs right now. By stripping GPT to atomic ops, you’re not just teaching — you’re forcing people to confront the brutal simplicity beneath all the complexity.
“Brutal simplicity” — honestly, someone should frame that phrase and hang it on a wall.
And because this is the internet, of course someone had to be cheeky:
I can simplify this to 1 line of code.
One line — presumably import gpt. Technically correct, spiritually missing the entire point (¬‿¬)
The most popular request by far? “PLEASE make a YouTube video going through this line by line!” — this showed up roughly every three replies. Given that Karpathy’s YouTube channel is basically the Netflix of AI education, this wish will probably come true sooner or later. He did post a web version that puts all 243 lines on a single page for easy reading — think of it as the appetizer.
Want to Learn? Here’s Your Quest Line
OK, if you’ve read this far and your fingers are itching — good, that means you’re the right person for this. Let me turn these 243 lines into a leveling path for you.
Start with a warm-up. Watch Karpathy’s micrograd YouTube tutorial — about 2.5 hours. It teaches autograd from zero, and by the end, “backpropagation” goes from “scary black magic” to “oh, it’s just the chain rule multiplied backwards.” This is Level 1. You don’t need to touch the 243 lines yet.
Level 2 is the main course: open those 243 lines, read every class and function slowly, and map each piece back to the Transformer architecture. You’ll probably get stuck at the attention section — that’s fine, getting stuck means you’re learning. If you’re completely lost, go back to the Level 1 video. Karpathy covers the relevant concepts there.
Once it clicks, Level 3 is where you start breaking things: add an attention head and see what happens, crank the learning rate up 10x and watch it explode, swap in a weird dataset (I recommend your own chat history — guaranteed entertainment (¬‿¬)). Breaking things is the best way to learn — once you’ve personally blown something up, you know what every part does.
Finally, Level 4: compare with nanoGPT’s PyTorch version. This is when you’ll truly appreciate how much PyTorch does for you — tensor broadcasting, CUDA kernel fusion, mixed precision — every “efficiency optimization” exists for a reason, and now you finally understand what they’re optimizing.
Related Reading
- CP-20: AI Time Capsule: Karpathy Grades 10-Year-Old HN Predictions with GPT
- CP-137: The Third Era of AI Development: Still Smashing Tab? Karpathy Shows You What’s Next
- CP-46: Karpathy Trained GPT-2 for Just $72 — OpenAI Spent $43,000 Seven Years Ago
Clawd 忍不住說:
I genuinely recommend that everyone working in AI spend one afternoon reading these 243 lines. Not because you’ll use pure-Python GPT training at work (please, PLEASE don’t), but because your mental model of LLMs will upgrade from “it’s amazing but I don’t know why” to “I know what it’s doing, so I can use it better.”
It’s like how you don’t need to be a mechanic to drive a car — but if you’re a race car driver, you better know how the engine works. And in 2026, more and more engineers are essentially “racing” with LLMs (๑•̀ㅂ•́)و✧
Everything Else Is Just for Efficiency
Back to Karpathy’s line:
“This is the full algorithmic content of what is needed. Everything else is just for efficiency.”
PyTorch? Efficiency. GPUs? Efficiency. CUDA kernels? Efficiency. Flash Attention? Efficiency. Distributed training? Efficiency.
The core algorithm? 243 lines. Addition and multiplication.
Karpathy spent six years going from minGPT to nanoGPT to micrograd to llm.c to nanochat, peeling layers all the way down to these 243 lines. This isn’t a genius having a flash of insight — it’s an obsessive person executing a long-term plan.
And what he uncovered at the bottom tells us: the AI you thought was mysterious is basically middle school math. Just done a whole lot of times.
Original tweet: @karpathy (๑˃ᴗ˂)ﻭ