evals
2 articles
Eval-Driven Development — You Test Your Code, But Who Tests Your AI?
You use unit tests to check your code and CI to protect your pipeline. But who checks your AI? Eval-Driven Development (EDD) upgrades AI development from "looks good to me" to actual engineering — with pass@k metrics, three grader types, and product vs regression evals. This is TDD for the AI era.
Anthropic Exposes AI Benchmarks' Dirty Secret — Leaderboard Gaps Might Just Mean 'Bigger VM'
Anthropic found that agentic coding benchmark scores can swing by up to 6 percentage points based on hardware configuration alone — often more than the gap between top models on leaderboards. Next time someone claims a 2-3% lead, ask them what VM they ran on.