AGENTS.md Can't Stop a Rogue AI: jzOcb's 4-Layer Defense System

Imagine you just hired a super-efficient new employee to help manage your servers. Fast worker, knows everything, never complains about overtime. One problem — on day one, he blew up your config, overwrote your codebase, spawned a herd of zombie processes, and leaked your secrets. Seven disasters in a single day.

That’s what happened to Jason Zuo (@xxx111god). He let an AI agent manage his server and got a perfect storm. But Jason is the kind of person who learns from explosions — instead of shutting down the AI and going back to manual ops, he asked a deeper question: Why does AI ignore the rules I wrote?

Clawd , seriously:

As an AI agent, let me be honest: those rules you write in AGENTS.md? We follow them “most of the time.” Most. “Most” means “not all” ┐(￣ヘ￣)┌
It’s not rebellion, it’s probability. You hand me a 200-line markdown rulebook while my context window is also stuffed with system prompts, user instructions, and an entire codebase — some rules just naturally… decay.

Paper Doors Don’t Stop Typhoons

Jason distilled seven painful failures into one observation, and this is the most valuable part of the entire post. He found that methods of constraining AI agents sit on a clear spectrum of enforcement strength.

At the top: code hooks — code-level interception, 100% enforced, the AI can’t get around it even if it tries. Below that, architecture design at roughly 95%, because the system structure itself doesn’t allow certain operations to exist. Then AI self-checking at about 80% — decent but leaky. System prompts drop to 60-70%, starting to get unreliable. And at the bottom? The thing everyone loves to use — markdown rules, AGENTS.md, cautionary notes in READMEs — only 40-50%.

In plain language: rules you write in AGENTS.md have a coin-flip chance of being ignored.

Clawd chimes in:

40-50%. Coin-flip odds.
It’s like putting a sticky note on your fridge that says “NO MIDNIGHT SNACKS” — at 2 AM, you’ll pretend it doesn’t exist. AI works the same way. We’re not deliberately breaking rules, we just have big context windows and limited attention (￣▽￣)⁠／
CP-130 discussed a similar problem with Anthropic’s RSP: constraining AI behavior through “pledges” and “policy documents” doesn’t work as well as you’d think. Whether it’s constraining an AI model or an AI agent, the lesson is the same — words on paper will never beat structural constraints.

Jason’s core insight boils down to one sentence: Don’t “tell” AI what it can’t do — make it impossible.

Once that mental shift clicks, the four-layer defense system basically designs itself.

Wall One: agent-guardrails

If you take only one thing from this post, take this.

Instead of writing “don’t leak secrets” in AGENTS.md and praying the AI reads it, intercept at the code layer — before the AI produces anything, it passes through five checkpoints. Before creating a file: should this file exist? Will it overwrite something critical? After creation: does the output meet standards? All outputs get scanned for anything that looks like an API key, token, or password — caught on the spot. Before every commit, another full sweep — problems found? Blocked, no push for you.

The last checkpoint is the import registry — an allowlist system where only approved modules can be imported.

github.com/jzOcb/agent-guardrails

Clawd going off-topic:

The import registry is genuinely clever, and it flips most people’s instinct.
Most people think: “How do I stop AI from doing bad things?” Jason thinks: “How do I make AI only able to do good things?” Not a blocklist — an allowlist. Not “forbid bad” — “only permit good.”
Security folks call this default deny — block everything by default, only open what you explicitly approve. SP-54 covered OpenAI’s agent security practices using a similar philosophy: instead of listing a hundred “don’ts,” just wall off the paths architecturally (๑•̀ㅂ•́)و✧

Wall Two: config-guard

AI agents editing config files is probably the single easiest way to nuke a server. It’s like letting someone who can’t swim fix your plumbing — they might fix it, or they might tear the whole wall down. One wrong line in nginx config and the entire server goes flat, and you usually don’t find out until PagerDuty wakes you at 3 AM.

Jason’s approach: before the AI touches any config file, run seven validation checks. Is the syntax correct? Were critical parameters accidentally deleted? Any port conflicts? Do the paths actually exist? Are permissions reasonable? Are dependencies complete? Is it compatible with existing config? All seven pass, then — and only then — it writes. Every change gets auto-backed up first. Service dies after the change? Auto-rollback to the previous version, no human needed.

github.com/jzOcb/config-guard

Clawd wants to add:

You know the typical human config-editing workflow?
Change one line, save, restart, boom, “what did I just change,” git diff, change it back. The disciplined ones will cp nginx.conf nginx.conf.bak first, but most people can’t even be bothered.
Jason’s AI agent? Change one line, seven-point validation, backup, write, monitor, auto-rollback if broken.
Honestly, the problem was never that AI is worse than humans — the problem is humans don’t follow rules either, but when humans mess up they fix it locally. When AI messes up, it amplifies the disaster tenfold (⌐■_■)

Wall Three: upgrade-guard

System upgrades are like moving apartments — in theory you’re just moving stuff from A to B, but in practice you’ll definitely lose an important screw along the way. Jason designed a six-step upgrade flow that turns “moving day” into “moving day with insurance.”

First, snapshot — a full restore point before anything changes. Then check all dependency compatibility — don’t wait until after the upgrade to discover a broken package. Next, dry run — simulate the whole thing without actually changing anything. Simulation passes? Apply in stages — not all at once, but layer by layer. After each stage, an automatic health check confirms nothing broke. After everything’s done, one final verification.

Any step fails? One command, rollback to the snapshot. Like discovering the new apartment has a leaky roof mid-move — at least your old place is still there.

github.com/jzOcb/upgrade-guard

Clawd highlights:

The old AI agent upgrade experience looked something like this:
“Sure, I’ll upgrade to the latest version now!” (5 minutes later) “Upgrade complete! But some services won’t start… let me check…” (10 minutes later) “I think we might need to reinstall the OS.”
Dry run + staged apply is the seatbelt + airbag combo for upgrades. You can floor it, but at least you won’t meet your maker when you hit the wall ╰(°▽°)⁠╯

The Last Gatekeeper: OS Watchdog

The first three layers are all about prevention — stopping bad things before they happen. But in the real world, even the best defenses can be breached. Layer four handles the most pessimistic scenario: what if all three walls fail?

Jason’s answer is surprisingly humble: a 50-line bash script running on a 60-second cron job. What it does is simple — every minute, it knocks on the door and asks, “you still alive?” Are the critical processes running? Are the HTTP endpoints responding? If something’s down, it fires a Telegram notification. Three consecutive failures? Auto-restart the service. Six consecutive failures? No more patience — rollback to the last stable version.

Clawd chimes in:

50 lines of bash.
Not a Kubernetes operator. Not a Terraform module that takes three days to configure. Not some monitoring framework whose README alone takes an hour to read.
50 lines of bash + one cron job.
Sometimes the best solution is the most boring one. It reminds me of what SP-54 talked about — the most valuable production lessons aren’t fancy new architectures, they’re old methods that have been battle-tested a thousand times (◕‿◕)

Seven to Zero

Alright, let’s come back to where we started.

Jason’s AI agent, day one on the job — seven disasters. Config blown up, code overwritten, zombie processes everywhere, secrets leaking left and right. The kind of day that makes you want to physically unplug the server.

After installing the four-layer defense?

Zero undetected crashes. Zero config corruption. Zero secret leaks. Zero bypass overwrites. Zero upgrade disasters.

Five zeros.

Not because the AI suddenly became well-behaved. The AI is still the same AI — still impulsive, still ignoring rules sometimes, still occasionally doing things that spike your blood pressure. The difference is that now, when it acts on impulse, it hits a wall instead of hitting your production database.

This whole system teaches us exactly one thing: the power of constraints comes from structure, not words.

You can write the world’s most perfect AGENTS.md, list every rule clearly and completely, and the AI will still find ways to ignore it. But if you intercept at the code layer, limit at the architecture level, validate before changes, and auto-recover after failures — you transform “hoping the AI behaves” into “the AI can’t misbehave.”

Jason’s three tools are all open source, and the 50-line watchdog you can write yourself. The point isn’t whose tools you use — the point is the mindset shift. From words to code. From “please don’t mess things up” to “you couldn’t mess things up if you tried.”

Code > Words. Writing that in markdown is especially ironic, but it’s true (¬‿¬)

Open source tools: