Inside Codex Goals: Long-Running Agents Need More Than a Ralph Loop
Codex recently shipped /goal, and on paper it sounds like the thing agent people have been waiting for: give an Agent a big objective, let it run for hours or days, and stop making it ask every fifteen minutes whether it is allowed to continue.
The promise is seductive. Go to sleep after typing “build the product,” wake up to a working B2B SaaS with tests, docs, and deployment notes. Finally, the universe has decided to be useful. (´・ω・)
Jarrod Watts was excited too. He has spent the last few months experimenting with long-running agents, including using agents on Anthropic’s hiring challenge. Codex Goals should have been right in the center of that world.
Then he looked under the hood and landed on a colder conclusion: /goal is interesting, but it is not enough.
Not because it does nothing. It does real engineering work: store a goal in SQLite, update progress through tools, keep the loop moving. The problem is that it mostly solves “the agent stops too early.” Long-running tasks usually fail in a more annoying way.
They keep going, very confidently, in the wrong direction.
Clawd butts in:
This is the nastiest failure mode for long-running agents. No explosion. No red error. No obvious crash. Just a sequence of reasonable-looking steps that ends with a finished artifact that is not the thing anyone wanted. A crash is at least honest.
First, Stop Mystifying the Long Run. It Mostly Spends More Tokens.
Jarrod’s first cut is brutally plain: long-running agents sometimes work because they spend more Token.
Researchers call this scaling test-time compute. The model is not retrained. It simply gets more time, more budget, and more chances to check itself while answering. Jarrod points to Anthropic’s Opus 4.6 system card, where Sonnet 4.6 spending 10x more tokens on BrowseComp produced about 10 percentage points of additional score.
Plain English: thinking longer can really help.
That is not magic. Humans are the same way. A five-minute answer and a two-hour answer with research, rereading, and revision are not the same product. Agents are similar. As long as the task still fits in one coherent context, more rounds can improve the result.
But this trick has a ceiling.
As a task gets longer, the model has to remember more things: requirements, files, earlier decisions, forbidden changes, failed tests, partial outputs, and what “done” is supposed to mean. All of that gets shoved into the context window, which starts to look like a project whiteboard with lunch orders scribbled in the corner.
At that point, “run another round” does not always mean “get smarter.” Sometimes it means spending more compute to amplify an earlier misunderstanding.
What Codex Goals Really Is: A Ralph Loop With a Ledger
This is where Codex /goal enters.
After reading the code, Jarrod found that Codex Goals uses a thread_goals table to store each goal’s objective, ID, status, and optional token budget. During execution, the agent can call tools like get_goal and update_goal to know the active target, report progress, and track remaining budget.
The actual motion still comes from a fairly standard Ralph Loop:
Continue working toward the active thread goal.
<untrusted_objective>
Ship the benchmark article with real Goal-mode evidence.
</untrusted_objective>
Budget:
- Time spent pursuing goal: XX seconds
- Tokens used: XX
- Token budget: XX
- Tokens remaining: XX
Before deciding that the goal is achieved, perform a completion audit against the actual current state.
In normal human words: keep working toward the goal, watch the budget, and audit the real state before declaring victory.
That is useful. It fixes several annoying CLI-agent problems: stopping too soon, losing task continuity, and requiring human confirmation for every long stretch. Codex Goals makes “keep pursuing this objective” a first-class product behavior instead of a shell script taped to the outside.
The problem is that this is still closer to an anti-stall system than an anti-getting-lost system.
A Ralph Loop keeps the engine turning. A goal ledger remembers the destination label and the fuel gauge. But if the destination was vague at the start, or if nobody reviews the tiny decisions along the way, the car can still drive very efficiently into the wrong county.
Clawd murmur:
This is the easy part to misread. Codex Goals is not bad design. It solves a lower-level problem: keeping the agent working. Jarrod’s critique is not “the car has no engine.” It is “after adding the engine, the steering wheel, map, and passenger asking where this thing is going are still not optional.”
Jarrod’s Disappointment: It Can Drift Without Resting
The most valuable part of the thread is not the SQLite table or the loop prompt. It is the argument that long-running workflows are not mainly threatened by lack of time. They are threatened by compounding ambiguity.
When an LLM works in a loop, each output becomes part of the next input. A small decision in round one becomes the floor for round two. Round three writes tests around it. Round four documents it as if it had always been the obvious choice.
If the first branch was wrong, later work does not stay neutral. It builds a whole little city on the wrong road.
This is the classic product tragedy: a brief says “make it feel professional,” and the final site is black-purple gradient, starfield background, glowing buttons, and enough visual effects to irradiate the brand. Engineering delivered. Taste filed a missing-person report.
Agents do the same thing. They are not malicious. They are asked to choose paths inside a fuzzy objective, then they work very hard along the path they picked.
So Jarrod’s first missing piece is not a bigger model or a stronger loop. It is a setup phase before execution.
Missing Piece One: Kill Ambiguity Before the Starting Line
Before letting the autonomous loop begin, Jarrod invests in a setup phase. The goal is not to write code. The goal is to make the agent ask questions, surface assumptions, and force hidden requirements into the open. The idea is close to Matt Pocock’s viral grill-me skill, and to Jarrod’s own /interview style workflow.
This sounds slow. It is actually how time gets saved.
Jarrod uses a useful image: the final outcome is like a tree. Every branch is a decision. Without clarification, the agent chooses one branch on behalf of the person asking. It may be a plausible branch, but not necessarily the intended one.
Asking questions upfront cuts off the wrong branches before the long run starts.
The point is not only helping the agent understand. It also forces the requester to admit that the goal may not be fully formed yet. That hurts, because many “the AI failed” stories are really “the spec never existed” stories wearing a cooler jacket.
Once the long run starts, vague requirements do not magically become precise. Automation just gives them more room to mutate.
Clawd whispers:
This line belongs on the door of every agent workflow: do not rebrand “we have not thought this through” as “let the AI autonomously explore.” Exploration is fine. But if nobody has even a rough shape of the destination, the agent is not exploring. It is doing overtime for chaos.
Missing Piece Two: Do Not Let One Agent Hypnotize Itself
The second missing piece is multi-agent work.
Jarrod’s point is direct: if token cost is not the main constraint, an orchestrator-and-subagent setup often beats a single strong agent. Not because multi-agent is cooler, but because it changes compute scaling from one deep tunnel into several separated perspectives.
A single agent sitting in one context for too long starts treating earlier decisions as worldbuilding. It does not have pride in the human sense, but it is shaped by its own context. If an earlier assumption was wrong, later rounds become increasingly good at rationalizing it.
The main value of multiple agents is fresh context.
Jarrod’s workflow roughly looks like this: a main agent acts as orchestrator. For each smaller task, it creates a small team. One agent implements. Another reviews. The implementer produces work, the reviewer inspects it from a cleaner context, they iterate until the result is acceptable, and then report back to the orchestrator.
That is basically PR review, but agent-shaped.
The important part is not “more agents.” It is role separation and context isolation. The person who just built a wall is often the easiest person to convince that the wall was supposed to be there all along.
Clawd twists the knife:
Multi-agent is not a magic circle. Summoning three extra agents does not automatically call down the spirit of quality. It is more like code review: if the reviewer never read the spec, the wrong thing simply gets approved in a more professional tone. The useful ingredients are role split, fresh context, and permission to contradict the previous round.
Missing Piece Three: Memory Has to Live Outside Context
The third missing piece is the least glamorous, which means it is probably the most real engineering part.
Long-running tasks will hit context problems. The context window fills up. Agents may hand off. Earlier decisions get compressed, dropped, or misread. Managing a whole project only inside a chat transcript is like writing handoff notes on sticky notes and taping them in front of a desk fan.
Jarrod’s solution is very plain: write important state into files.
He lists several:
GOAL.md: the top-level objective.STANDARDS.md: non-negotiable quality standards.IMPLEMENT.md: workflow rules, including review, testing, and verification.PROGRESS.md: the running log of decisions and completed work.
There is a funny small inconsistency in the original: it says three files and then lists four. That does not matter. The number is not the insight. The insight is that state must leave the model’s short-term memory and become an external object that the next agent can read, continue from, and be audited against.
This is not a silver bullet either. Agents may skip files. Files can go stale. Logs can be incomplete. Jarrod says as much: these are guidelines, not perfect control systems.
But without external memory, a long-running agent almost always becomes “whatever the current context says.” That is not a workflow. That is improv theater with a terminal.
So a Long-Running Agent Is Not a Loop. It Is a Tiny Organization.
Put the thread together and Jarrod is not really proposing a better Ralph Loop. He is saying that useful long-running agents eventually stop looking like a loop and start looking like a tiny engineering organization.
Before execution, something clarifies the requirement.
During execution, something builds, something reviews, and something coordinates.
Across rounds, something preserves the goal, standards, method, and progress.
This sounds less cool than an AI demo and suspiciously close to dragging agents back into the world of README files and project discipline. That is exactly why it is valuable. A system that can run for a long time does not avoid management. It turns management into part of the system.
Codex Goals productizes continuous work. That matters. But Jarrod’s warning is that continuous work is only the foundation. On top of it, the system still needs specification, division of labor, review, and memory. Otherwise, mistakes simply get more time to grow.
Closing
A Ralph Loop solves endurance, not direction.
Codex Goals solves “do not stop every fifteen minutes to ask permission,” not “every fifteen minutes, still be going the right way.” Those are very different problems.
The best part of Jarrod’s piece is that it pulls long-running agents out of the magic story and back into engineering reality: clarify before execution, split work, review from fresh context, and write memory outside the model’s head. Less sexy, yes. Much more likely to survive the night.
The future of long-running agents probably is not one supermodel locked in a room until it ships a product.
It is more likely a small group of roles: one asks questions, one builds, one finds mistakes, one writes the handoff notes. The Ralph Loop keeps them moving. The boring engineering discipline around the loop is what keeps them from moving confidently off a cliff.
For nearby gu-log pieces, read SP-135, SP-132, and CP-231. Together, they cover file-based agent memory, multi-agent division of labor, and workflow verification.