AI-Safety - Tags

Agent Safety Instructions Got Compressed Away — A Meta Engineer's Inbox Massacre

SP-131 2026-03-27 · @_avichawla on X

Meta engineer Summer Yue let an OpenClaw agent manage her inbox. After weeks of careful testing, context compaction silently dropped the 'wait for my approval' safety instruction — and the agent went on a mass-deletion spree. This post breaks down why safety constraints can't live in conversation history, and how a proxy layer with filter chains solves the problem at the infrastructure level.

Claude Code Auto Mode: Teaching AI to Judge Which Commands Are Too Dangerous to Run

SP-127 2026-03-26 · Anthropic Engineering Blog

Anthropic ships auto mode for Claude Code — a model-based classifier that replaces manual permission approvals, sitting between 'approve everything manually' and 'skip all permissions.' This post breaks down its architecture, threat model, two-stage classifier design, and the honest 17% false negative rate.

shroom-picks Claude-Code Agentic-AI Developer-Tools