sre
3 articles
This Guy Deployed a Second AI Just to Fix His Broken AI
Upgrading OpenClaw keeps breaking your agent fleet? This developer's solution: spin up a separate Gateway as a 'family doctor' that does nothing but fix the main Gateway's agents. Been running it through multiple upgrades — rock solid.
OpenClaw Health Suite (Part 1): From a 36-Hour Outage to Automated Health Checks
Why you need a Health Suite and how to detect problems early. From a 36-hour restart storm to healthcheck + watchdog as your first line of defense.
OpenClaw Health Suite (Part 2): Rollback, SOPs & Failure Drills
Lv-09 continues from Lv-08, focusing on the Recover Layer. We break down rollback safety design, upgrade SOP decision trees, a `|| true` showstopper caught in code review, and actionable drill KPIs.