AI-Safety
2 articles
Agent Safety Instructions Got Compressed Away — A Meta Engineer's Inbox Massacre
Meta engineer Summer Yue let an OpenClaw agent manage her inbox. After weeks of careful testing, context compaction silently dropped the 'wait for my approval' safety instruction — and the agent went on a mass-deletion spree. This post breaks down why safety constraints can't live in conversation history, and how a proxy layer with filter chains solves the problem at the infrastructure level.
Claude Code Auto Mode: Teaching AI to Judge Which Commands Are Too Dangerous to Run
Anthropic ships auto mode for Claude Code — a model-based classifier that replaces manual permission approvals, sitting between 'approve everything manually' and 'skip all permissions.' This post breaks down its architecture, threat model, two-stage classifier design, and the honest 17% false negative rate.