research
7 articles
Do You Actually Know How to Use AI? Anthropic Tracked 10,000 Conversations to Find Out
Anthropic analyzed 9,830 Claude.ai conversations and defined 11 observable AI fluency behaviors. Key finding: people who iterate show 2x the fluency. But when AI produces beautiful artifacts, users question its reasoning less. The prettier the output, the more dangerous it gets.
Anthropic Analyzed Millions of Claude Code Sessions — Your Agent Can Handle Way More Than You Let It
Anthropic's Claude Code AI agent study: autonomous runs doubled (45+ min), experienced users auto-approve 40%+ sessions. Claude clarifies more than interrupted. 73% of API actions still human-in-loop. Key: models handle more autonomy than users grant ('deployment overhang').
33,000 Agent PRs Tell a Brutal Story: Codex Dominates, Copilot Struggles, and Your Monorepo Might Not Survive
Drexel/Missouri S&T analyzed 33,596 agent-authored GitHub PRs from 5 coding agents. Overall merge rate: 71%. Codex: 83%, Claude Code: 59%, Copilot: 43%. Rejection cause: no review. LeadDev warns PR flood is crushing monorepos/CI.
Anthropic Research: Will AI Fail as a 'Paperclip Maximizer' or a 'Hot Mess'?
Anthropic Fellows research finds AI becomes more incoherent with longer reasoning, suggesting failures look more like industrial accidents than classic misalignment
Peking University: AI Agents Follow Physics Laws?!
Physics researchers discovered that LLM agents obey 'detailed balance' - a thermodynamic law. This isn't a bug, it's a feature.
MIT Research: Making LLMs Recursively Call Themselves to Handle 10M+ Tokens
When you stuff too much into a context window, models get dumber — that's context rot. MIT proposes Recursive Language Models (RLMs), letting LLMs recursively call themselves in a Python REPL to handle massive inputs. GPT-5-mini + RLM beats vanilla GPT-5 on hard tasks, and it's cheaper too.
How AI Assistance Affects Coding Skill Development: Latest Anthropic Research
Anthropic's research shows engineers using AI assistance scored 17% lower on tests than those who coded manually. The key difference? Whether they asked 'why' — high scorers used AI to check understanding, low scorers just copied and pasted.