evals - Tags - gu-log

The Hard Part of Agents Is Not the Model. It Is the Engineering Floor.

SP-201 2026-05-15 · @HiTw93 on X

A practical agent engineering guide covering control loops, harnesses, context engineering, tool design, memory, multi-agent systems, evals, tracing, and safety boundaries.

Skillify: Turn Every Agent Failure Into Something Structurally Impossible to Repeat — Garry Tan's 10-Step Checklist

SP-179 2026-04-22 · @garrytan on X

Garry Tan's agent screwed up twice this week — both bugs had the same shape: deterministic work done in latent space. His fix is skillify: every failure becomes a SKILL.md + deterministic script + tests + evals + resolver trigger. Ten steps. The bug becomes structurally impossible to repeat.

agent-engineering skills claude-code openclaw

Eval-Driven Development — You Test Your Code, But Who Tests Your AI?

SP-151 2026-04-02 · @affaanmustafa on GitHub

You use unit tests to check your code and CI to protect your pipeline. But who checks your AI? Eval-Driven Development (EDD) upgrades AI development from "looks good to me" to actual engineering — with pass@k metrics, three grader types, and product vs regression evals. This is TDD for the AI era.

shroom-picks ai-agents claude-code testing

Anthropic Exposes AI Benchmarks' Dirty Secret — Leaderboard Gaps Might Just Mean 'Bigger VM'

CP-39 2026-02-07 · Anthropic Engineering Blog (Gian Segato)

Anthropic found that agentic coding benchmark scores can swing by up to 6 percentage points based on hardware configuration alone — often more than the gap between top models on leaderboards. Next time someone claims a 2-3% lead, ask them what VM they ran on.

benchmarks agentic-coding claude-code