DeepSeek-R1 Grew Its Own Internal Debate Club — Nobody Asked It To

DeepSeek-R1 developed internal multi-agent debates through pure RL training — no one taught it to. Google researchers call this the 'Society of Thought.' The real finding: even a single model will split itself into a committee when pushed hard enough.

AI Doesn't Need to Memorize the Times Table Anymore: How Reasoning and Tool Calling Let Small Models Punch Above Their Weight

Apple MLX creator Awni Hannun points out a counterintuitive insight: intelligence-per-watt is skyrocketing partly because models no longer need to memorize answers they can compute. Reasoning and tool calling free up weight space, meaning 5B-15B models might eventually match today's GPT-5.x — though nobody really knows the ceiling yet.

Typewriter vs Editor: Mercury 2 Reinvents LLMs with Diffusion — 5x Faster Reasoning, 4x Cheaper

Inception Labs launches Mercury 2 — the world's first reasoning Diffusion LLM. Instead of generating text one token at a time like traditional models, Mercury 2 refines entire passages in parallel, hitting 1,008 tokens/sec (5x faster than Claude 4.5 Haiku) at 1/4 the price. Backed by Andrew Ng, Karpathy, and Eric Schmidt.

Google Launches Gemini 3.1 Pro: 77.1% on ARC-AGI-2 and a Bigger Push Into Real Reasoning Workflows

Google announced Gemini 3.1 Pro (preview), highlighting stronger core reasoning and a verified 77.1% score on ARC-AGI-2. The model is rolling out across Gemini API, Vertex AI, Gemini app, and NotebookLM. For engineering teams, the key question is not only benchmark performance, but whether the model can reliably handle complex multi-step workflows in production.