Karpathy Built an 8-Agent AI Research Team — They Can't Actually Do Research

Karpathy spent a weekend running 4 Claude + 4 Codex agents as an ML research team on GPUs. The result: agents are S-tier at implementation but F-tier at experiment design. His key insight — 'You are now programming an organization' — might define agentic engineering in 2026.

Karpathy's Honest Take: AI Agents Still Can't Optimize My Code (But I Haven't Given Up)

Opus 4.6 & Codex 5.3 sped up Karpathy's GPT-2 training by 3 mins. Karpathy failed similar attempts, noting AI's weak open-ended code optimization. Opus deletes comments, ignores CLAUDE.md, and errs. Yet, with oversight, models are useful.