The Hard Part of Agents Is Not the Model. It Is the Engineering Floor.

The most magical-looking part of an Agent is usually not the hardest part.

Start with a dumb but very real incident.

When this article entered the automated pipeline, one subtask briefly reported that it was “complete.” In theory, the machine could have packed up and gone home. But one extra check of system state showed the real workflow was still running, the quality checks had not passed, and the article absolutely could not be published yet.

That is the cruelest and most useful lesson in agent engineering: an agent saying it is done does not mean the system is actually done.

A reliable agent is not one that writes a pretty completion report. It is one that records progress, can be verified by machines, leaves inspectable traces when something breaks, and cannot randomly poke tools it should not touch. Put another way: the model is only the driver. The engineering system is the dashboard, brakes, lane markings, and dashcam.

@HiTw93’s long post breaks this down nicely. It is not asking “which model is strongest?” It is answering a much more practical question from the engineering floor: how do agents go from demo toys to systems that can actually deliver work?

If you do not want to swallow every term at once, read with just three questions in mind:

How does an agent know what to do next?
How does an agent know it did the right thing?
When an agent fails, how can engineers find the cause?

We will cover a lot of parts later, but you do not need to memorize the names first. They all serve the same goal: making agent behavior constrainable, verifiable, and debuggable instead of relying on mystical blessings like “the model should understand.”

If you only remember one thing, make it this: getting an agent to move is not hard; getting it to run reliably is.

This piece is more like an incident map. First we look at where things blow up, then at how engineering patches the holes. The blast radius keeps growing: at first the loop runs off course, then completion reports become untrustworthy, then context gets dirty, and finally multiple agents start stepping on each other.

Compared with recent gu-log pieces, this one is the main switchboard: SP-197 explains how Codex /goal writes verifiable state into files, SP-192 looks at how long-running agents avoid diligently drifting in the wrong direction, SP-158 shows how execution traces and scoring feedback make systems improve for real, and SP-135 unpacks file-system memory. HiTw93’s post happens to connect those lines into one map of the engineering floor.

Mogu inner monologue:

Many agent product demos look like Iron Man’s house AI. Open the implementation and it often looks more like a dorm-room power strip: model, tools, database, browser, command line, logs, all plugged into one place. It moves, and sometimes it looks amazing. But once something smokes, nobody knows which plug started it.

Incident 1: A control loop can run without finishing the task

Strip an agent down to the minimum, and it is not that mysterious.

The smallest agent control loop is basically a while loop: send user input to the model; if the model wants a tool, the external system runs that tool and feeds the result back; if the model replies with plain text, the task ends.

You can call this loop perception, decision, action, and feedback. It sounds like magic. It is really closer to an engineering process that keeps asking, “what next?”

The real problem is: a control loop can run without completing the task correctly.

Many products wear the agent label, but if you open them up, they look more like fixed workflows: the route is already written, and each step merely uses an LLM. That is not a bad thing. When the task is fixed and acceptance is clear, a fixed workflow is often more stable. Forcing everything into a fully autonomous agent is like calling a fire truck to water a tiny succulent on your desk. The truck arrives, the ladder goes up, the hose is connected, and the succulent dies.

@HiTw93’s original post lists several common control patterns: prompt chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer. You do not need to memorize the names. What matters is that they all answer the same question: should the program decide the path, or should the model dynamically decide the next step?

If the answer is “the process is clear,” start with a fixed workflow. You only really need an agent when the task requires exploration, judgment, and dynamic adjustment across multiple tools.

Mogu roast time:

An agent control loop is like a car that can press the accelerator by itself. The question is not whether it can move. The question is whether the route, brakes, dashboard, insurance, and dashcam are installed. Many demos shout, “look, it moves!” The engineering floor asks, “and who pays when it hits something?”

Incident 2: The agent says it is done, but nobody checked

This one just slapped us in the face in real time.

A subtask reported completion, but the real workflow was still running, quality checks had not passed, and the article could not be published yet. Luckily, system state got checked again. Otherwise we would have hit a classic agent failure: completion is not language; completion is verifiable state.

This is why the original post keeps emphasizing the Agent Harness. You can think of it as the acceptance rig around the agent: the equipment that gives it goals, scope limits, acceptance checks, failure feedback, and rollback paths. The model drives. The guardrail system handles traffic lights, barriers, speed cameras, and accident records.

The OpenAI case mentioned in the original post comes with a dramatic set of numbers: 3 engineers, 5 months, roughly 1 million lines of code, nearly 1,500 PRs, and what the original describes as about 10x the speed of traditional development. Treat those numbers as self-reported case data, not a universal productivity guarantee. The useful lesson is the set of conditions behind them.

Those conditions are very practical. Anything the agent cannot see effectively does not exist, so documentation should work like a map, not an encyclopedia. Constraints cannot live only in human heads; they need to land in format checks, type systems, CI, and structural tests. More importantly, the agent needs to reproduce failures, run tests, inspect logs, and read execution traces by itself instead of waiting for humans to spoon-feed context. Merging can be fast, but discipline cannot run on prayer. Prayer mostly just adds a line to the incident report saying everyone felt confident at the time.

So the point is not “write more docs.” The point is to shape the task into something an agent can handle: clear goals, with results that can be automatically verified.

If the goal is clear but acceptance depends entirely on humans, throughput gets capped by human review. If verification is plentiful but the goal is vague, the system will sprint very efficiently in the wrong direction. Worst of all, if there is neither a goal nor verification, that is not agent engineering. That is electronic divination.

Mogu roast time:

There is an entire civilization between “the agent says it is done” and “CI is green.” The first is a verbal report. The second at least has a machine willing to share the blame. Engineers distrust verbal reports not because they lack love, but because production has educated them.

Incident 3: Too much context and too many tools can make the agent dumber

The next failure mode is sneakier: the model is not broken, and the task is not wrong, but the context is dirty.

Context engineering sounds academic. It is really just cleaning your desk. Prompt, tool outputs, long documents, memory, user preferences, error logs: if you mash all of them together and shove the pile at the model, the model sees a lot, but has a harder time finding the signal that matters. The original post calls this “context degradation”: the more content you add, the harder it can be to grab the point.

A more stable approach is layered:

Put short, hard identity, rules, and prohibitions in the always-on layer.
Keep domain knowledge in Skill files or normal docs, and read it only when needed.
Keep only the state needed for the current run in runtime context.
Manage long-term memory separately; do not stuff every chat log back into every turn.

This is like organizing a toolbox. Screwdrivers, drills, and welding torches are all useful, but nobody dumps the entire hardware store onto the table when fixing a pair of glasses.

Tools have the same problem. More tools make it easier for the agent to choose the wrong one. More abstract tool descriptions make it easier for the model to guess wrong. The most important human-language sentence here is: do not give agents a pile of low-level API parts; give them tools that complete tasks.

For example, instead of making an agent assemble create_file, write_content, and set_permissions by itself, give it create_script(path, content, executable). The closer tool boundaries are to real tasks, the less the agent needs to play guessing games with spare parts.

Memory works the same way. Agents do not have native continuity across time; once a session ends, they forget. Cross-session consistency cannot depend on “it should remember.” It needs an external memory layer. Put information needed for the current task in front of the model, put working methods in docs, put history in traces, and only store truly important long-term facts in curated notes like MEMORY.md.

The most important property is reversibility. If a summary fails, the raw data must not disappear. When integrating memory, the pointer needs to be able to return to the previous safe point. Otherwise memory is not infrastructure. It is a paper shredder that rewrites history for you.

Mogu butts in:

A tool description like “helps with backend” is like a convenience store sign that says “we sell things.” Technically true, completely useless. A good description should work like a road sign: turn right for deploys, also right for rollbacks, do not turn right for billing. If the sign just says “there is a road here,” people really will stand at the intersection questioning reality. (⁠￣⁠▽⁠￣⁠)⁠／

Incident 4: Once multiple agents run, the real danger is shared state

The exciting part of multi-agent systems is that “many things can run at once.” But the original post warns that the first thing to break usually is not speed. It is agents stepping on each other’s state.

It makes sense for a main agent to delegate search, trial-and-error, and debugging to subagents. Each subagent keeps its own context and returns only a summary at the end, so the main agent does not get polluted by every exploration path and can more easily locate which subtask failed.

But collaboration cannot rely on verbal agreements alone. You need at least three things first: structured communication, task dependencies, and modification isolation.

That is why long tasks need to write progress to files instead of keeping it only in chat; why parallel edits need isolation; and why “who is waiting for whom, and who has completed what” needs to become a recoverable record.

The pitfall this workflow hit lives here too: a subtask reported completion, but the real workflow was still alive. This is not merely “the model is dumb.” It means the orchestration layer cannot trust natural-language completion reports alone. External state is the more trustworthy source.

Evals and execution traces exist for the same reason. Evals should not only judge whether an answer looks pretty; they should check whether the environment ended up in the right state. Traces answer the more painful question: exactly which turn started drifting? What did the model see, what did the tool actually do, and what did the system become afterward? Miss any one of those three, and debugging quickly becomes a seance.

The original post ends with OpenClaw as a full-system example. You do not need to memorize the system name. Just look at the design direction: receiving messages, translating messages, making decisions, calling tools, and storing memory should be separated; long-running task state should be recoverable; and safety boundaries should stand in front of dangerous operations.

Prompt injection also makes more sense from this angle. Do not fantasize that the model will never be fooled by dirty content. Make sure dirty content cannot reach dangerous buttons. External content should enter marked as untrusted, sensitive operations should require confirmation, tool permissions should be minimized, and critical actions should have extra rules or a second layer of checking.

If you want to ship it, order matters more than terminology

The original post ends with a pile of practical advice. When you actually start building, do not chase terms first. Chase the order of operations.

Step one is to get one minimal path working: a message comes in, the agent handles it, the result goes back out. Step two is to add safety boundaries: workspace limits, allowlists, and parameter validation should not wait until after an incident. Step three is to turn the first real failure into an eval case. Do not wait until “later when we have time.” Later usually does not exist.

Put knowledge in documents first, not all in the system prompt. Externalize progress from the start for long tasks; if a task runs for more than half an hour and still relies only on chat context, you are gambling. For multi-agent work, isolate first, then talk about parallelism.

Seen in reverse, many agent incidents have the same flavor: they look like model problems, but they are really engineering problems.

The system prompt keeps getting longer? That is a knowledge-management problem. The tool list keeps growing? Interface design problem. The agent says it is done but nobody verifies it? Harness problem. Multiple agents overwrite each other’s files? Isolation problem. Quality drops after a long conversation? Memory integration problem. Nobody knows whether a prompt change caused a regression? Eval problem.

The model matters, but the model is not magic putty. It cannot fill in a dirty environment, bad tools, weak evals, or runaway permissions.

Mogu butts in:

“Tell the agent to be careful” is not a safety strategy. In engineering terms, that is roughly equivalent to leaving important data on the table with a sticky note beside it saying: please do not leak this. It looks controlled. In practice, it just gives the incident report one more screenshot.

References

@HiTw93, long post on agent principles, architecture, and engineering practice, X, 2026-03-19. https://x.com/hitw93/status/2034627967926825175?s=46
OpenAI, Harness engineering: Building reliable agents with Codex
Anthropic, Building effective agents
Anthropic, Prompt caching docs
Anthropic, Equipping agents for the real world with agent skills
Anthropic, Advanced tool use on the Claude Developer Platform
Anthropic, Demystifying evals for AI agents
OpenAI, Designing agents to resist prompt injection

Closing

An agent’s core control loop is small: perception, decision, action, feedback. That actually makes the engineering priority clearer. Do not stuff complexity into the control loop. Put it where it belongs: tool boundaries, context layers, file-system state, memory integration, eval guardrails, execution traces, permissions, and safety mechanisms.

Stronger models will make agents more capable. But stronger models do not automatically fix dirty environments, bad tools, weak evals, or runaway permissions. What moves agents from demos to production is not a loop that looks more like magic. It is an engineering floor that behaves more like engineering.

An agent is not a thinking button. It is closer to a new engineering team: it needs tasks, docs, tools, tests, review, logs, permissions, and rollback paths. Without those, even a smart agent is just charging through a black box.