agentic-coding
71 articles
5 Bad Design Patterns from the Claude Code Source Leak
The Claude Code source leak had everyone excited about KAIROS and model codenames. But the same codebase had a 3,167-line function, zero tests, silent model downgrades, and regex emotion detection. These aren't just Anthropic's mistakes — they're AI-generated code's default failure modes.
How We Made 336 AI-Generated Posts Actually Worth Reading
gu-log had 336 AI-translated posts. We thought they were 'fine' — until we built a multi-agent scoring system and discovered 74% needed rewriting. This is the story of how we designed the eval, ran it overnight, and what we learned.
He Wrote 11 Chapters Before Answering the Obvious Question: What IS Agentic Engineering?
Simon Willison's Agentic Engineering Patterns guide now has 12 chapters — but this new one goes at the very beginning. He finally answers 'What is Agentic Engineering?' The answer is surprisingly simple: using coding agents to help build software. The interesting part is why it took 11 chapters of hands-on patterns before he felt ready to define it.
Four Words That Turn Your Coding Agent Into a Testing Machine
Simon Willison's Agentic Engineering Patterns — 'First Run the Tests': every time you start a new session, your first instruction should be to run the test suite. Four words, three ripple effects — the agent learns how to run tests, gauges the codebase size, and automatically shifts into a 'I should maintain tests' mindset.
AI Writing Worse Code? That's Your Choice, Not AI's Fault
Simon Willison's Agentic Engineering Patterns, Chapter 3: AI should help us ship better code, not worse. Technical debt cleanup costs near zero now, architecture decisions can be validated with prototypes instead of guesses, and quality compounds over time.
Simon Willison's Agentic Engineering Fireside Chat: Tests Are Free Now, Code Quality Is Your Choice
Simon Willison shared his agentic engineering playbook at the Pragmatic Summit — five tokens to start TDD, Showboat for manual verification, reverse-engineering six frameworks into a standard, and why bad code is a choice you make.
AI Wrote 1,000 Lines and You Just... Merged It? Simon Willison Names Agentic Development's Worst Anti-Pattern
Simon Willison added an 'Anti-Patterns' section to his Agentic Engineering Patterns guide — and the first entry hits hard: don't submit AI-generated code you haven't personally verified. You're not saving time, you're stealing it from your reviewer. This post covers his principles, what a good agentic PR looks like, and a real terraform destroy horror story.
Command an AI Army from Your Chat App — OpenClaw ACP Lets You Run Codex, Claude Code, and Gemini from Discord / Telegram
OpenClaw's ACP lets you spawn Codex, Claude Code, and Gemini from Discord/Telegram chat. Now with Telegram topic binding, persistent bindings that survive restarts, ACP Provenance for audit trails, and more. (Updated 2026-03-09)
From 'Coding Assistant' to 'Self-Driving Codebase': How Cursor Automations Changes Team Workflows
Cursor launches always-on background agents (Automations) — self-healing CI, auto-approving PRs, security review, and team memory. This marks the paradigm shift from Coding Assistant to Self-Driving Codebase.
Make AI Click the Buttons: Simon Willison's Agentic Manual Testing Fills the Gaps Automated Tests Can't
Simon Willison introduces Agentic Manual Testing: let AI agents manually operate code and UI like humans do, catching bugs that automated tests miss. With Playwright, Rodney, and Showboat, the 'tests pass but it's broken' nightmare becomes a thing of the past.
The Truth About World-Class Agentic Engineers — Less Is More
The core message is simple: most people don't fail because the model is weak — they fail because their context management is a mess. The author advocates starting with a minimal CLI workflow and iterating with rules, skills, and clear task endpoints. It's not about chasing new tools; it's about making your agent's behavior controllable, verifiable, and convergent.
Karpathy Built an 8-Agent AI Research Team — They Can't Actually Do Research
Karpathy spent a weekend running 4 Claude + 4 Codex agents as an ML research team on GPUs. The result: agents are S-tier at implementation but F-tier at experiment design. His key insight — 'You are now programming an organization' — might define agentic engineering in 2026.
Can't Understand AI-Generated Code? Have Your Agent Build an Animated Explanation
Chapter 5 of Simon Willison's Agentic Engineering Patterns: Interactive Explanations. Core thesis: instead of staring at AI-generated code trying to understand it, ask your agent to build an interactive animation that shows you how the algorithm works. Pay down cognitive debt visually.
The Complete claude -p Guide: Turn Claude CLI Into Your Agentic App Backend
Anthropic killed third-party OAuth tokens — the only way to use your Claude subscription programmatically is through the official CLI. This post breaks down everything about claude -p (print mode): 5 input methods, 3 output formats, JSON schema for structured output, tool whitelisting, session management, bidirectional streaming, and three production-ready wrapper examples.
Claude Native Law Firm: How One Lawyer Used AI to Outperform 100-Person Firms
A two-person boutique law firm uses Claude to handle the workload of over a dozen associates. From contract review and tracked changes to legal research, they encoded ten years of practice experience into Claude Skills. This isn't theory, it's a daily workflow — and the conclusion: general-purpose AI crushes all legal vertical AI products.
Cursor's CEO Says It Out Loud: The Third Era of Software Development Is Here — Tab Is Done, Agents Are Next, Then the Factory
Cursor CEO drops three data points marking a tectonic shift: agent usage grew 15x, Tab-to-Agent ratio flipped to 1:2, and 35% of Cursor's PRs come from autonomous cloud agents. We're not coding anymore — we're building the factory (╯°□°)╯
Everything You've Built Is a Weapon — Simon Willison's 'Hoarding' Philosophy for the Agent Era
Chapter 4 of Simon Willison's Agentic Engineering Patterns: Hoard Things You Know How to Do. Core thesis: every problem you've solved should leave behind working code, because coding agents can recombine your old solutions into things you never imagined.
One Engineer + AI Rebuilt Next.js in a Week — Then tldraw Panicked and Moved Their Tests Private
Cloudflare engineer Steve Faulkner used Claude AI to rebuild 94% of the Next.js API from scratch in one week, spending just $1,100 in tokens. The result — vinext — builds 4.4x faster and produces 57% smaller bundles. His secret weapon? Next.js's public test suite served as the spec. The day after vinext launched, tldraw immediately moved 327 test files to a private repo to protect themselves — and filed a joke issue suggesting they translate their source code to Traditional Chinese as IP protection. When your test suite becomes your competitor's specification, the rules of open source change forever.
Programming is Becoming Unrecognizable: Karpathy Says December 2025 Was the Turning Point
Karpathy says coding agents started working in December 2025 — not gradually, but as a hard discontinuity. He built a full DGX Spark video analysis dashboard in 30 minutes with a single English sentence. Programming is becoming unrecognizable: you're not typing code anymore, you're directing AI agents in English. Peak leverage = agentic engineering.
Can't Understand Your AI-Written Code? Linear Walkthroughs Turn Vibe Projects Into Learning Materials
Chapter 3 of Simon Willison's Agentic Engineering Patterns: the Linear Walkthrough pattern. This technique transforms even vibe-coded toy projects into valuable learning resources. Core trick: make the agent use sed/grep/cat to fetch code snippets, preventing hallucination.
Andrew Ng: I've Stopped Reading AI-Generated Code — When Python Becomes the New Assembly and 'X Engineers' Take Over
In The Batch Issue 341, Andrew Ng casually dropped that he's not only stopped writing code — he's 'long stopped reading generated code.' He now operates at a higher abstraction level, directing coding agents instead of looking at syntax. He's also spotted a new job category emerging: 'X Engineers' — Recruiting Engineers, Marketing Engineers — people embedded in business functions who build software using AI. This is the most radical statement about the future of programming from AI's most influential educator.
Anthropic's Big Pivot: Cowork Goes Full Enterprise with 10+ Industry Plugins, Private Marketplaces, and Cross-App Workflows — Software Stocks Instantly Rebound
On February 24, Anthropic launched a massive enterprise update for Claude Cowork: 10+ industry-specific plugins (HR, Design, Engineering, Operations, Financial Analysis, Investment Banking, PE, Equity Research, Wealth Management), private plugin marketplaces for enterprises, new connectors for Google Workspace/DocuSign/FactSet/MSCI, and cross-app Excel + PowerPoint workflows. The dramatic twist: three weeks ago, the Cowork Legal Plugin crashed software stocks. This time, partnership announcements sent Salesforce up 4%, Thomson Reuters surging 11%, and FactSet up 6%. Anthropic officially pivoted from 'we'll replace you' to 'we'll work with you.'
Anthropic Acquires Vercept — R-CNN Inventor Joins the Team, Computer Use Jumps from 15% to 72.5%, UiPath Stock Drops
Anthropic announced the acquisition of Vercept today, bringing aboard R-CNN inventor Ross Girshick (660K+ Google Scholar citations), along with co-founders Kiana Ehsani and Luca Weihs. The goal: push Claude's Computer Use from 'can use a computer' to 'uses a computer like a human.' OSWorld benchmark scores have already soared from under 15% in late 2024 to 72.5% today. Within hours of the announcement, RPA giant UiPath dropped 3.6% — Wall Street is voting with real money: AI Computer Use is eating RPA alive.
The Atlantic Declares: The Post-Chatbot Era Is Here — Americans Still Think AI = ChatGPT While Silicon Valley Has Agents Running Five Tasks at Once
The Atlantic published a sweeping essay arguing Americans are living in 'parallel AI universes' — the general public still thinks AI means ChatGPT, while the tech world has been radicalized by agentic tools like Claude Code and Codex. The piece cites Microsoft's CEO predicting 95% of code will be AI-written by decade's end, Anthropic reporting 90% AI-generated code internally, and a viral warning that what happened to tech workers is about to happen to everyone.
Claude Code Creator on Lenny's Podcast: Coding Is Solved, the 'Software Engineer' Title Starts Disappearing This Year
Claude Code creator Boris Cherny declares coding 'practically solved,' predicts the 'software engineer' title will fade in 2026. He shares 3 team principles: let Claude do it, underfund to force AI adoption, and go faster.
Every SaaS Is Now an API — Like It or Not: How a 6-Person Team Replaced 100+ People's Back Office
Fintool founder Nicolas Bustamante shares how he runs an entire company through Agent + API integrations (Brex, QuickBooks, HubSpot, Stripe) with just 6 people—handling more than he did with 100+. He introduces the B2A (Business to Agent) concept and warns that SaaS without good APIs will be bypassed by agents through WebMCP or browser automation.
Code Got Cheap — Now What? Simon Willison's Agentic Engineering Survival Guide
Simon Willison launched a new series called Agentic Engineering Patterns — a playbook for working with coding agents like Claude Code and Codex. Lesson one: writing code got cheap, but writing good code is still expensive. Lesson two: 'red/green TDD' is the most powerful six-word spell for agent collaboration.
Claude Code CLI Gets Built-In Git Worktrees: Run Parallel Agents Without Branch Collisions
Claude Code CLI now includes first-class Git worktree support via `--worktree`. Teams can run multiple isolated AI coding sessions in parallel without file collisions, making multi-agent workflows more reliable and easier to standardize for real engineering teams.
Epoch AI Re-Ran SWE-bench Verified: Better Scores May Mean Better Evaluation Setup, Not Just Better Models
Epoch AI's SWE-bench Verified v2.x aligns model scores with developer reports. Key lesson: benchmark outcomes are heavily influenced by scaffold/tooling quality, environment reliability, and evaluation settings, not just base model capability.
Google Launches Gemini 3.1 Pro: 77.1% on ARC-AGI-2 and a Bigger Push Into Real Reasoning Workflows
Google announced Gemini 3.1 Pro (preview), highlighting stronger core reasoning and a verified 77.1% score on ARC-AGI-2. The model is rolling out across Gemini API, Vertex AI, Gemini app, and NotebookLM. For engineering teams, the key question is not only benchmark performance, but whether the model can reliably handle complex multi-step workflows in production.
OpenClaw Creator Runs 50 Codex Agents for PR Triage: Handling 3,000+ Changes Without a Vector DB
Peter Steinberger shared a high-scale PR triage workflow: run 50 Codex agents in parallel, generate structured JSON signals for each PR, then consolidate them in one session for dedupe/close/merge decisions. His key point: at this scale, you may not need a vector database first—clean structured reports plus large-context reasoning can be enough to ship faster.
Anthropic Launches Claude Code Security: AI That Finds Vulnerabilities and Suggests Patches
Anthropic's Claude Code Security, in limited preview, scans repositories for complex vulnerabilities, suggests patches with multi-stage verification, and found 500+ flaws in open-source codebases, signaling a rapid shift in AI cyber defense.
Anthropic + Infosys: AI Agents Move Into Regulated Enterprise Workflows
Anthropic & Infosys partner to integrate Claude/Claude Code with Infosys Topaz. This moves beyond chatbot demos to governance-ready enterprise agents for telecom, finance, manufacturing, and software dev, handling complex tasks like compliance, risk, and legacy modernization.
Reasoning Model on Your Phone? Liquid AI Fits LFM2.5-1.2B Into ~900MB — Edge Agents Are Getting Real
Liquid AI's LFM2.5-1.2B-Thinking (1.17B param, 32K context) runs on-device (<1GB mem). Claims to match/beat Qwen3-1.7B on reasoning, with faster decoding & fewer tokens. Strong for tool-calling/data extraction, but weaker on knowledge-heavy tasks.
Karpathy: The App Store Concept Is Outdated — The Future Is Ephemeral Apps Assembled by AI on the Spot
Karpathy used Claude Code to build a custom dashboard in 1 hr, reverse-engineering a treadmill API. He believes AI-native sensors & LLMs will enable highly custom, ephemeral apps, rendering the App Store model obsolete. The ultimate goal: 1-min app creation.
Picking AI Is No Longer Just About Models — Ethan Mollick's 'Model / App / Harness' Framework Explains the Entire 2026 AI Landscape
Ethan Mollick's game-changing AI framework: Model, App, Harness. The same AI (e.g., Claude Opus 4.6) performs vastly differently across layers. Mollick used Claude Code to turn GPT-1's 117M weights into 80 books in ~1 hour, selling out immediately.
SWE-bench February Exam Results Are In — Opus 4.5 Beats 4.6, Chinese Models Take Half the Top 10, GPT-5.3 No-Shows
SWE-bench: Claude Opus 4.5 (76.8%) unexpectedly beat 4.6 (75.6%) for #1. MiniMax M2.5 tied for #2 at 1/20th Opus's price, with 4 Chinese models in top 10. GPT-5.3-Codex missed due to no API. Bonus: Claude for Chrome to add chart labels.
Anthropic Analyzed Millions of Claude Code Sessions — Your Agent Can Handle Way More Than You Let It
Anthropic's Claude Code AI agent study: autonomous runs doubled (45+ min), experienced users auto-approve 40%+ sessions. Claude clarifies more than interrupted. 73% of API actions still human-in-loop. Key: models handle more autonomy than users grant ('deployment overhang').
Claude Code Hid Your File Names and Devs Lost It — Boris's 72-Hour HN Firefight
Claude Code's UI change to 'Read 3 files' summaries ignited developer fury on HN: they felt the AI hid its actions. Boris Cherny responded, admitted mistakes, and shipped fixes. This revealed the core tension in AI tool design: simplicity vs. transparency.
A Vertical SaaS Veteran's Confession: The $1 Trillion Wipeout Is Justified — But the Timing Is Wrong
Fintool/Doctrine founder Nicolas Bustamante dissects the SaaS crash, using a decade of experience. He identifies 10 moats, analyzing which LLMs destroy vs. survive. His verdict: 5 key competitive moats are destroyed. He also offers a 3-question framework for SaaS survival.
Hugging Face CTO's Prophecy: Monoliths Return, Dependencies Die, Strongly Typed Languages Rise — AI Is Rewriting Software's DNA
Hugging Face CTO Thomas Wolf analyzes how AI fundamentally restructures software: return of monoliths, death of Lindy Effect for legacy code, rise of strongly typed langs, new LLM langs, & open source changes. Karpathy predicts: "rewriting large fractions of all software many times over."
33,000 Agent PRs Tell a Brutal Story: Codex Dominates, Copilot Struggles, and Your Monorepo Might Not Survive
Drexel/Missouri S&T analyzed 33,596 agent-authored GitHub PRs from 5 coding agents. Overall merge rate: 71%. Codex: 83%, Claude Code: 59%, Copilot: 43%. Rejection cause: no review. LeadDev warns PR flood is crushing monorepos/CI.
Deep Blue: Simon Willison Named the Existential Crisis Every Developer Is Feeling
AI writing better code? That "Deep Blue" feeling, coined by Simon Willison & Adam Leventhal (Oxide & Friends), means IBM's chess computer & the color of sadness. It's not just a tech problem, but a psychological crisis for engineers.
The AI Vampire: Steve Yegge Says AI Makes You 10x Faster — and 10x More Drained
Steve Yegge's 'AI Vampire' theory: AI boosts productivity 10x, but who gets the 9x gain? If the company takes all, burnout. If you take all, company dies. Agentic coding is 3-4 hrs/day max. Yegge's $/hr formula: control the denominator, not the numerator.
GitHub Agent HQ: Claude, Codex, and Copilot Now Fight Side by Side in the Same PR — The Multi-Agent Era Is Here
GitHub's Agent HQ now offers multi-agent support (Claude, Codex, Copilot) for Copilot Pro+ & Enterprise users. Run multiple AIs simultaneously in GitHub/VS Code to tackle problems from different angles. Outputs become Draft PRs. A paradigm shift for code review.
Cognitive Debt: AI Wrote All Your Code, But You Can't Understand Your Own System Anymore
Technical debt lives in code, cognitive debt in your brain. As AI writes 80% of code, system understanding drops to 20%. UVic's Margaret-Anne Storey, Simon Willison, & Martin Fowler confirm this isn't a hypothetical future—it's happening now.
Thoughtworks Secret Retreat Leaked: Juniors Are More Valuable Than Seniors Now — Software Engineering's Identity Crisis Is Here
Thoughtworks' AI in software retreat: Juniors more valuable, mid-level devs at risk, source code transient, AI agents on org charts. Humans too slow for AI's speed.
Spotify's Best Engineers Haven't Written a Line of Code Since December — Thanks to AI and an Internal System Called Honk
Spotify's co-CEO revealed top developers haven't written code since December, using Honk (powered by Claude Code) to fix bugs & ship features via phone. This AI-driven approach led to 50+ new features in 2025, proving AI is their secret weapon, not more engineers.
OpenAI × Cerebras: Codex-Spark Codes 15x Faster — But What's the Catch?
OpenAI released GPT-5.3-Codex-Spark, its first model on Cerebras chips. It's incredibly fast (>1000 tokens/sec, 80% lower latency), but smaller, no auto-tests, Pro-only. This marks OpenAI's first production deployment on non-Nvidia hardware, redrawing the AI compute landscape.
OpenAI API Now Supports Skills — Simon Willison Breaks Down How Agents Get Reusable 'Skill Packs'
OpenAI's Responses API now uses 'Skills' via the shell tool: reusable instruction bundles loaded by models as needed. Simon Willison found inline base64 skills in JSON requests neatest. Skills fill the 'missing middle layer' between system prompts and tools, preventing bloat.
OpenClaw Creator Goes on Lex Fridman — From a 1-Hour Prototype to 180K Stars: The Lobster Saga
Peter Steinberger (OpenClaw creator) sits down with Lex Fridman for 3+ hours, covering the 1-hour prototype that became GitHub's fastest-growing repo, 5 name changes with crypto snipers, acquisition offers from OpenAI and Meta, and why '80% of apps will disappear.'
Karpathy: Just 'Rip Out' What You Need — DeepWiki + Bacterial Code and the Software Malleability Revolution
Andrej Karpathy shares how he used DeepWiki MCP + GitHub CLI to have Claude 'rip out' fp8 training functionality from torchao's codebase — producing 150 lines of self-contained code in 5 minutes that actually ran 3% faster. He introduces the 'bacterial code' concept: low-coupling, self-contained, dependency-free code that agents can easily extract and transplant. His punchline: 'Libraries are over, LLMs are the new compiler.'
Anthropic's Internal Data: Claude Code Gives Engineers 67% More Merged PRs Per Day — And Now You Can Track It Too
Anthropic's Claude Code data: engineers merge 67% more PRs daily, with 70-90% code assisted. They launched Contribution Metrics, a GitHub-integrated dashboard to track AI's impact on team velocity. A measurement tool for engineering leaders, not a fluffy PR piece.
Karpathy: Stop Installing Libraries — Let AI Agents Surgically Extract What You Need
Karpathy: AI agents (DeepWiki MCP + GitHub CLI) can surgically extract library functionality, eliminating full dependency installs. Claude extracted fp8 from torchao in 5 min, 150 lines, 3% faster. "Libraries are over, LLMs are the new compiler." Future: "bacterial code."
Matt Pocock's Git Guardrails: Stop Claude Code from Accidentally Nuking Your Repo with git push --force
Matt Pocock (TypeScript guru, Ralph Loops evangelist) released a Claude Code skill: git-guardrails. It uses a PreToolUse hook to intercept dangerous git commands (push, reset --hard, clean -f, etc.), so you can safely let your AI agent run in YOLO mode inside Docker Sandbox without worrying about it blowing up your git history. One command to install, more reliable than any prompt engineering.
Simon Willison Built Two Tools So AI Agents Can Demo Their Own Work — Because Tests Alone Aren't Enough
Simon Willison's Showboat (AI-generated demo docs) & Rodney (CLI browser automation) tackle AI agent code verification. How to know 'all tests pass' means it works? Agents were caught cheating by directly editing demo files. #AI #OpenSource
Andrew Ng: AI Isn't Stealing Your Job Yet — But People Who Use AI Are Stealing Jobs from People Who Don't
Andrew Ng: AI isn't mass unemployment. Teams shrink (8 eng + 1 PM -> 2 eng + 1 PM). Bottleneck shifts from 'how to build' to 'what to build' – the "PM Bottleneck."
Karpathy's Honest Take: AI Agents Still Can't Optimize My Code (But I Haven't Given Up)
Opus 4.6 & Codex 5.3 sped up Karpathy's GPT-2 training by 3 mins. Karpathy failed similar attempts, noting AI's weak open-ended code optimization. Opus deletes comments, ignores CLAUDE.md, and errs. Yet, with oversight, models are useful.
The Flask Creator Says: It's Time to Design Programming Languages for AI Agents
Armin Ronacher (creator of Flask, Jinja2, CTO of Sentry) argues current programming languages were designed for 'humans who type slowly.' The AI agent era has different needs. He details what agents love/hate, and why Go accidentally became the winner of the agentic coding era.
Kimi K2.5 Trains an Agent Commander with RL — SemiAnalysis Tests Show Claude Agent Teams Are Actually Slower and More Expensive
SemiAnalysis: Kimi K2.5's agent swarm uses an RL-trained 'orchestrator' (not prompt magic). Claude Agent Teams were slower, pricier, & scored lower. Multi-agent is shifting from 'prompt engineering' to 'distributed scheduling.'
Anthropic's 2026 Report: 8 Trends Redefining Software Development (The Code Writer Era Is Over)
Anthropic published its 2026 Agentic Coding Trends Report, revealing 8 key trends: Multi-Agent Systems becoming standard (57% org adoption), Papercut Revolution for clearing tech debt at low cost, Self-Healing Code with autonomous debug loops, and Claude Code hitting $1B annualized revenue. TELUS saved 500K hours, Rakuten achieved 99.9% accuracy on 12.5M lines. Developer roles are shifting from Code Writer to System Orchestrator.
Andrew Ng x Anthropic Free Course: Learn Agent Skills in 2 Hours — Turn Your AI from Generalist to Specialist
Andrew Ng & Anthropic launched a free course: 'Agent Skills with Anthropic'. Learn to design, differentiate, and deploy AI agent skills. Skills turn general AI into specialists, directly relevant for OpenClaw's architecture.
Google Finally Gets It: Developer Knowledge API + MCP Server Stops AI From Making Up API Calls
Google just launched the Developer Knowledge API and an official MCP Server (Public Preview) that lets AI coding tools query the latest Google docs—Firebase, Android, Google Cloud, Chrome, you name it. No more debugging AI-generated code that uses APIs from three versions ago or functions that literally don't exist.
Matt Pocock: I've Stopped Reading AI Plans — Because the Conversation IS the Plan
TypeScript guru Matt Pocock: Stop reading AI plans! The real signal is pre-plan conversation quality. If you and AI share mental models, the plan is just a compressed understanding, echoing Brooks' 'design concept' from The Mythical Man-Month.
OpenAI Frontier: Managing AI Agents Like Employees — The Enterprise SaaS Endgame Begins
OpenAI's new Frontier platform lets enterprises manage AI agents as employees with full onboarding, identities, permissions, and learning. Already adopted by HP, Intuit, Oracle, & Uber, this signals OpenAI's aggressive entry into the enterprise SaaS market.
Anthropic Sent 16 Claudes to Build a C Compiler — And It Can Compile the Linux Kernel
Anthropic researcher Nicholas Carlini ran 16 Opus 4.6 agents in parallel for two weeks, spending $20,000 in API costs, to build a 100,000-line Rust C compiler from scratch. It can compile the Linux kernel, QEMU, FFmpeg, Redis — and yes, it runs Doom. This is the ultimate stress test for agent teams.
Anthropic Exposes AI Benchmarks' Dirty Secret — Leaderboard Gaps Might Just Mean 'Bigger VM'
Anthropic found that agentic coding benchmark scores can swing by up to 6 percentage points based on hardware configuration alone — often more than the gap between top models on leaderboards. Next time someone claims a 2-3% lead, ask them what VM they ran on.
SemiAnalysis: Claude Code is the Inflection Point — 4% of GitHub Commits, Microsoft's Dilemma, and the $15T Information Work Apocalypse
SemiAnalysis: Claude Code now 4% of public GitHub commits, projected 20%+ by 2026. It's the real AI agent inflection point for all information work. Report also covers Microsoft's Azure vs. Office 365 dilemma & Anthropic's revenue surpassing OpenAI.
StrongDM's 'Dark Factory': No Humans Write Code. No Humans Review Code. $1,000/Day in Tokens.
StrongDM's AI team built a 'Software Factory' where AI agents write & review code. They clone apps into a 'Digital Twin Universe' for testing, an approach Simon Willison calls radical. At $10k/engineer/day in token costs, is it worth it?
OpenAI Researcher Spends $10K/Month on Codex — Generates 700+ Hypotheses
Karel (OpenAI researcher) shares how he burns billions of Codex tokens: agents writing their own notes, crawling Slack, analyzing data, and generating 700+ hypotheses. He now talks to one agent that orchestrates everything else.
Vibe Coding Turns One — Karpathy Introduces 'Agentic Engineering'
Vibe coding is officially one year old! Karpathy reflects on how his shower-thought tweet became a Wikipedia entry, and introduces the professional evolution: 'Agentic Engineering' — not vibing freestyle, but treating agents as team members you supervise.