evaluation
3 articles
Auto-Harness — The Open-Source Framework That Lets AI Agents Debug Themselves
NeoSigma open-sourced auto-harness — a self-improving loop that lets AI agents mine their own failures, generate evals, and fix themselves. On Tau3 benchmark, same model, just harness tweaks: 0.56 → 0.78.
What Is Your Agent Actually Doing in Production? Traces Are Where the Improvement Loop Begins
LangChain's conceptual guide breaks down agent improvement into a trace-centric loop: collect traces, enrich them with evals and human annotations, diagnose failure patterns, fix based on observed behavior, validate with offline eval, then deploy — each cycle starting from higher ground.
Epoch AI Re-Ran SWE-bench Verified: Better Scores May Mean Better Evaluation Setup, Not Just Better Models
Epoch AI's SWE-bench Verified v2.x aligns model scores with developer reports. Key lesson: benchmark outcomes are heavily influenced by scaffold/tooling quality, environment reliability, and evaluation settings, not just base model capability.