evaluation - Tags

Auto-Harness — The Open-Source Framework That Lets AI Agents Debug Themselves

GP-160 2026-04-04 · @gauri__gupta on X

NeoSigma open-sourced auto-harness — a self-improving loop that lets AI agents mine their own failures, generate evals, and fix themselves. On Tau3 benchmark, same model, just harness tweaks: 0.56 → 0.78.

What Is Your Agent Actually Doing in Production? Traces Are Where the Improvement Loop Begins

GP-158 2026-04-03 · LangChain

LangChain's conceptual guide breaks down agent improvement into a trace-centric loop: collect traces, enrich them with evals and human annotations, diagnose failure patterns, fix based on observed behavior, validate with offline eval, then deploy — each cycle starting from higher ground.

shroom-picks agents observability langsmith llmops

Epoch AI Re-Ran SWE-bench Verified: Better Scores May Mean Better Evaluation Setup, Not Just Better Models

MP-109 2026-02-22 · Epoch AI

Epoch AI's SWE-bench Verified v2.x aligns model scores with developer reports. Key lesson: benchmark outcomes are heavily influenced by scaffold/tooling quality, environment reliability, and evaluation settings, not just base model capability.

epoch-ai swe-bench benchmark agentic-coding tech-lead