llm
24 articles
Karpathy's LLM Knowledge Base Workflow — Let AI Build Your Personal Wikipedia
Andrej Karpathy shares his workflow for building a personal knowledge base with LLMs: dump raw materials in, let LLMs compile them into a Markdown wiki, then use CLI tools for Q&A, linting, and visualization. He thinks there's room for an incredible new product here.
Running a Trillion-Parameter Model on a MacBook? The Wild SSD Streaming Experiment
Simon Willison shared a new trend in running massive MoE models on Macs: streaming expert weights from SSD instead of cramming everything into RAM. Even a trillion-parameter Kimi K2.5 runs on a 96GB MacBook Pro.
Karpathy's Software Horror: One pip install Away From Losing All Your Keys
LiteLLM hit by supply chain attack — pip install was enough to steal all credentials. Karpathy warns about dependency tree risks and advocates using LLMs to yoink functionality instead of adding more deps.
Squeezing Every Drop of Performance: Ditching Python for Metal Shaders to Run Large Models Locally
Developer @danveloper shares their experience running Qwen3.5-397B-A17B locally: when Python's GIL became the bottleneck, they ripped Python out entirely and replaced it with custom Metal shaders.
Fine-tuning Qwen3-4B to 'Believe It Has Consciousness' — While Barely Changing Anything Else
N8 Programs shared a Qwen3-4B demo: after KL-regularized SFT, the model believes it has consciousness while other behaviors barely change. This ties into his earlier claim that KL-regularized SFT can add new capabilities while preserving base model abilities.
Dan McAteer's verdict: Opus 4.6 has no real competition at 1 million tokens
Dan McAteer shares his long-context observations: Opus 4.6 performs best at 1 million tokens with 78% accuracy, Sonnet 4.6 is the closest competitor, and GPT-5.4 actually regressed compared to GPT-5.2 at long context.
Stuffing a Computer Inside the Transformer: How This Trick Lets LLMs Crush Sudoku
Christos Tzamos highlights a fascinating gap: LLMs can solve research-grade math but still fumble basic arithmetic. His team's approach? Embed a computer directly inside the transformer — and it solves the hardest Sudoku puzzles at 100% accuracy.
Vibe Coding's Real Power Might Not Be Speed — It's Cutting Out the Middlemen
SemiAnalysis argues that Vibe Coding's real adoption driver isn't just faster code — it's eliminating the long telephone game between domain experts and implementation. But if you don't know what you want, the LLM will happily build the wrong thing at warp speed, and production still needs real engineers.
Agents That Steer Themselves? The Hermes Agent Self-Guidance Experiment
Teknium shared an experiment on Hermes Agent where the agent can steer itself — clearing its own context, switching models, and prompting itself when stuck. A short tweet, but it points at a big shift in how agent control works.
GPT-5.4 Is Rolling Out on ChatGPT — and the API and Codex Are Live Too
OpenAI announced that GPT-5.4 Thinking and GPT-5.4 Pro are rolling out on ChatGPT, with GPT-5.4 also available via the API and Codex. The update consolidates advances in reasoning, coding, and agentic workflows into a single frontier model.
Agents Can Tune Neural Nets Now? Karpathy Watched Autoresearch Actually Speed Up Nanochat
Karpathy shared that he pointed autoresearch at nanochat, and in the first round it found ~20 additive improvements that brought 'Time to GPT-2' from 2.02 hours down to 1.80 hours. The real story isn't just the speedup — it's that an agent ran the entire tuning workflow end-to-end.
AI agent started tuning hyperparameters on its own — Karpathy says this is real
Andrej Karpathy shares how his autoresearch agent autonomously tuned nanochat's training config over two days, found ~20 improvements to validation loss that transferred to a larger model, and pushed the Time to GPT-2 leaderboard from 2.02h to 1.80h — about 11% better.
From Prompt to Production: A Practical Guide to Agentic AI Architecture
DataTalksClub founder Alexey Grigorev shared the full syllabus for his AI Engineering Buildcamp — six modules covering LLM APIs, RAG, Agentic Flows, Monitoring & Guardrails, Evaluation, and a Capstone project. It's one of the most complete learning paths for building agentic AI applications in production.
Your LLM Isn't Writing Correct Code — It's Writing Code That Looks Reasonable
The author benchmarked system SQLite against an LLM-generated Rust rewrite. Even though it compiled and passed all tests, primary key lookups were ~20,000x slower. The takeaway: define acceptance criteria before you talk about AI productivity.
MCP Lifesaver? Context Mode Saves You 98% of Context Tokens
A hot HackerNews project called Context Mode uses sandbox isolation and smart retrieval to block bloated tool outputs from flooding LLM context windows — claiming up to 98% token savings!
Programming is Becoming Unrecognizable: Karpathy Says December 2025 Was the Turning Point
Karpathy says coding agents started working in December 2025 — not gradually, but as a hard discontinuity. He built a full DGX Spark video analysis dashboard in 30 minutes with a single English sentence. Programming is becoming unrecognizable: you're not typing code anymore, you're directing AI agents in English. Peak leverage = agentic engineering.
The LLM Context Tax: 13 Ways to Stop Burning Money on Wasted Tokens
The 'Context Tax' in AI brings triple penalties: cost, latency, & reduced intelligence. Nicolas Bustamante's 13 Fintool techniques cut agent token bills by up to 90%. A real-money guide for optimizing AI context, covering KV cache, append-only context, & 200K token pricing.
The SaaS Moat Is Crumbling — When LLMs Eat the Interface, All That's Left Is API vs API
Nicolas Bustamante argues LLMs are ending Ben Thompson's Aggregation Theory. With chat as the universal interface, SaaS companies' moats built on 'workflow complexity + user muscle memory' evaporate, leading to pure API vs API commodity competition.
Karpathy Trained GPT-2 for Just $72 — OpenAI Spent $43,000 Seven Years Ago
Karpathy open-sourced nanochat — a minimal LLM training framework. With 8 H100 GPUs running for 3 hours at $72, you can train a GPT-2 level model. OpenAI spent $43,000 training the same model in 2019. That's a 600x cost reduction. On spot instances, it's just $20.
AI Time Capsule: Karpathy Grades 10-Year-Old HN Predictions with GPT
Karpathy used GPT 5.1 to analyze decade-old Hacker News threads and find out who actually predicted the future (◕‿◕)
Simon Willison's 2026 Predictions: Is AI Replacing Human Coding?
Simon Willison shares his 2026 LLM predictions on Oxide and Friends podcast — LLM code quality will be undeniable, sandboxing will finally get solved, and there's a prediction about kākāpō parrots (◕‿◕)
MIT Research: Making LLMs Recursively Call Themselves to Handle 10M+ Tokens
When you stuff too much into a context window, models get dumber — that's context rot. MIT proposes Recursive Language Models (RLMs), letting LLMs recursively call themselves in a Python REPL to handle massive inputs. GPT-5-mini + RLM beats vanilla GPT-5 on hard tasks, and it's cheaper too.
Karpathy's 2025 LLM Year in Review — The RLVR Era Begins
From RLVR to Vibe Coding, Karpathy breaks down 6 key LLM developments in 2025
Sebastian Raschka's 2025 LLM Review — The RLVR Era Has Arrived
From RLVR to inference-time scaling, what happened in 2025? Raschka's year-end summary highlights the key shifts