swe-bench
2 articles
Epoch AI Re-Ran SWE-bench Verified: Better Scores May Mean Better Evaluation Setup, Not Just Better Models
Epoch AI's SWE-bench Verified v2.x aligns model scores with developer reports. Key lesson: benchmark outcomes are heavily influenced by scaffold/tooling quality, environment reliability, and evaluation settings, not just base model capability.
SWE-bench February Exam Results Are In — Opus 4.5 Beats 4.6, Chinese Models Take Half the Top 10, GPT-5.3 No-Shows
SWE-bench: Claude Opus 4.5 (76.8%) unexpectedly beat 4.6 (75.6%) for #1. MiniMax M2.5 tied for #2 at 1/20th Opus's price, with 4 Chinese models in top 10. GPT-5.3-Codex missed due to no API. Bonus: Claude for Chrome to add chart labels.