Codex Is No Longer Just for Code — It Is Becoming an Operating System for Computer Work

At 3 PM, an engineer returns to a Codex conversation that has been sitting there for days.

The table is still full: the website from yesterday, the tests that already ran, the reviewer comment about tight spacing, someone waiting in Slack, and a small launch checklist that says “do not forget this again.” The engineer does not explain the whole project from zero. They only add: “Continue from yesterday, but fix the spacing in the side panel first.”

That is the interesting part of this Codex shift. The point is not only “the model writes code better.” Codex is starting to put computer work on one table: memory stays in the thread, tools plug in around it, artifacts sit beside the conversation, and automations can wake the same workflow later.

Older Codex felt like a coding helper inside an editor. This newer direction feels more like a workbench that keeps the scene intact. Code is still on the desk, but now the browser, terminal, documents, calendar, inbox, desktop GUI, review surface, and shared memory are also close enough to use.

Mogu wants to add:
This is easy to misread as “Codex added many features.” Sure, it did. But the real story is control. The human does not have to cut the job into ten tiny slips of paper for ten tools. Codex also does not wake up confused every time. The table stays there, the stuff stays there, and the next round can continue. Boring? Yes. Also where product work usually dies (⁠◕⁠‿⁠◕⁠)

Persistent threads: the table does not reset every day

The biggest waste in short chats is not that the model is dumb. It is that every session resets the table.

Yesterday’s naming rule must be explained again. Last week’s test trap gets stepped on again. The reviewer’s tone preference, the file that should not be touched, the email still waiting for a reply — all of it is trapped in old chat history. The human can scroll back, but the work itself did not really carry it forward.

A persistent thread turns a conversation into a long-lived workspace. It is not only a transcript. It is the work surface: documents, half-finished artifacts, open loops, and reasons for past decisions all stay there. When the user returns, Codex does not need to ask, “So what is the context?”

The best pinned threads are usually not one-off tasks like “fix this bug.” They are recurring workflows:

Chief-of-staff thread: messages, calendar, to-dos, and people who need replies.
Release thread: versions, tests, docs, and launch checklists.
Documentation-review thread: keep docs aligned with product changes.
External-monitoring thread: watch PR comments, document comments, and Slack replies.

Pinning and quick switching look like small interface details, but they reveal the product model: a thread is not a disposable sticky note. It is a fixed seat at the workbench.

Mogu whispers:
”Remembering the last sentence” is not impressive. A chatbot should do that. The useful part is remembering the work: who decided what, where the task got stuck, and which annoying question should never be asked again. Not romantic, but it saves a lot of “wait, didn’t we already agree on this?” moments.

Voice: putting rough work on the table

A lot of work does not begin as a clean instruction.

“I think someone mentioned this in Slack. I forgot the details. Go find it first.” Typed into a ticket, that sounds terrible. Said out loud, it feels normal. Humans often start work by dropping a messy situation on the table, then finding the edges.

Voice input is not mainly about saving a few seconds. It preserves the shape of rough thought. A two-minute spoken plan, a raw meeting transcript, or a half-finished question can contain hesitation, priority, emphasis, and the real point of friction. A polished prompt often deletes those things.

For an agent that can search, gather context, and report back, “I forgot the details; go look” is already enough to start. Voice puts unfinished work on the table. Codex can then fill in the missing context.

Steering, queuing, and mobile: the human can leave the desk, but not vanish

Once a long task starts running, the hard part is not waiting. The hard part is control.

Sometimes the direction is wrong right now. During a website review, the side panel shows a crowded layout, bad button copy, or a section in the wrong order. That needs steering: insert new direction into the task while it is still running. Do not wait until the whole thing finishes before discovering it drifted.

Sometimes the current direction is fine, but the next step is waiting. After the website is fixed, send the preview link to the reviewer in Slack. After the document is generated, export a PDF. After tests pass, prepare the launch checklist. That is queuing: keep the current work running, but put the next job beside it.

Steering controls what is happening now. Queuing controls what happens after this. Together, they keep the human in the loop without forcing them to stand beside the model until their eyes dry out.

Mobile fits here too. The point of the Codex mobile app is not to squeeze a whole development environment into a phone. The point is that the human can leave the desk and still check progress, answer questions, approve the next step, or redirect the task. Files, permissions, environment variables, and repo state remain on the Mac. The phone pulls the human back only when a decision is needed.

Mogu highlights:
The worst expectation for mobile Codex is “write a full PR on a phone.” That is not productivity; that is a finger endurance test. A healthier role is: the local environment keeps running, and the human can approve, interrupt, or queue the next step while buying coffee. The phone is not the workbench. It is the bell on the workbench.

Tool radius: the workbench gets sockets

Threads answer “what does Codex remember?” The next question is: what can it touch?

The first ring is the web page. Codex does not only read HTML. It can inspect the rendered result, operate the page, and respond to annotations on the surface. Many UI problems only become obvious after someone sees the output. Reading files is not enough.

The second ring is the signed-in browser. Internal tools, SaaS admin panels, and flows where login state is the ticket in can start to join the same table.

The third ring is the desktop. Old workflows with no API, no CLI, only windows and buttons, used to be human-only. Desktop operation makes those workflows describable and checkable too.

MCP servers and connectors are easiest to understand as safe sockets on the edge of the workbench. Slack, Gmail, and Calendar matter not because the names are trendy, but because many real tasks do not start as code. They start as messages, email, schedule conflicts, or one sentence from a human: “Can you look at this?”

When a workflow repeats, it can become a Skill. A Skill is not magic. It is a standard operating card taped beside the table. Do not teach the same routine to the agent every time; write down the working procedure so Codex can run it again.

Mogu highlights:
The line around Skills needs restraint. A daily workflow that always drops one step deserves a Skill. A weird task that happens once every three weeks and looks different every time probably does not. Organizing the workbench is good. Covering the whole desk with “maybe useful someday” rules is just another kind of mess.

Automations and goals: wake the thread, then know when to stop

A pinned thread is still passive. It waits for the human to return.

Thread automation is more like setting an alarm for the same table. Every few minutes or hours, wake the same thread, return to its existing context, and check the state. If the condition is not ready, wait. If it is ready, move to the next step.

Some scheduled jobs should start fresh, like a daily report or a regular repo check. Other jobs should return to the same conversation because the context is part of the work. A chief-of-staff-style thread could scan messages and inboxes, find items that need attention, research a reply, and draft it without sending. When the human returns, the expensive context gathering is done. The authority to send still belongs to the human.

Feedback loops follow the same pattern: check PR comments, Google Doc comments, or Slack replies; regenerate the artifact; bring the review state back into the original thread. If the last step only exists through a desktop GUI, desktop automation covers that part.

But waking up is not enough. A long task also needs to know what “done” means. The point of Goals is not to make the task sound more ambitious. It is to give Codex a real finish line.

A weak goal is: “Implement this Markdown plan.” It sounds clear, but there is no parking space. A strong goal has a verifier. Migrating an internal tool from Python to Rust is not “try to migrate it.” It is: the new implementation is done when the unit tests pass. The verifier might be a test suite, benchmark, bug reproduction, validation matrix, or an end-to-end workflow that must stay green.

A goal without a verifier is just a slogan on the table. Codex can work very hard, but effort is not the same thing as arrival.

Mogu butts in:
Goals are easy to frame as the protagonist. A better reading: the goal is the finish line on the table. The work actually moves because threads, tools, schedules, side panels, and verifiers are wired together. A long task without a verifier is like a task title that says “get stronger.” Very inspiring. Still no idea which muscle to train.

The side panel: artifacts should stay within arm’s reach

Once an artifact leaves the thread, the work starts to split.

A document gets downloaded somewhere else. A slide deck opens in another tool. A webpage moves to another tab. A table goes into a spreadsheet app. Review comments scatter across Slack or Google Docs. In theory, it is still one task. In practice, it is five small worlds.

The Codex side panel keeps the artifact beside the thread that created it. The discussion and instructions sit on one side; the artifact sits right next to them. Markdown, spreadsheets, tables, documents, slides, PDFs, and browser pages do not need to be thrown into another world before review can happen.

The in-app browser matters a lot here. A web page can be both output and control surface. Codex can generate a page, open it, inspect the rendered result, notice what broke, and keep fixing. Comments do not need to become a separate ticket because they live on the surface being reviewed.

This is especially useful for artifacts where the problem only appears when someone looks: a single index.html static page, UI component reviews, programmatic animation, browser slides, data apps, and analysis workflows. They do not only need files. They need a visible result, review notes, and the next edit connected to the same thread.

A single index.html can even become a durable interactive artifact. Thread automation can refresh it over time, so a new state is waiting beside the thread when the human returns.

Mogu real talk:
I read the side panel as part of the control loop, not pretty UI. When the artifact leaves the thread, the next step often becomes “open another ticket.” When the artifact stays beside the thread, the correction can plug straight back in. Not sexy. Also exactly where product work bleeds.

Shared memory: what does not fit on the table goes into the cabinet

Long-running threads are useful, but a thread should not become the graveyard for every memory.

A more durable pattern is to write reviewable, movable, versionable context into external memory. This is close to SP-200 on Markdown memory: a plain folder of text notes, stored in Git, Dropbox, Google Drive, or the team’s normal sync layer.

One common name for that kind of folder is an Obsidian vault. The name sounds fancy. The basic idea is just a note warehouse that is easy to move, search, and version.

The structure can be simple:

vault/
├── TODO.md
├── people/
├── projects/
├── agent/
└── notes/

The important part is not copying this exact tree. The important part is using AGENTS.md to tell Codex what should be preserved, where it belongs, and when not to create churn. Think of AGENTS.md as the handoff rules taped beside the workbench.

A useful AGENTS.md might say:

Treat ~/vault as durable work memory.
Prefer canonical notes over note sprawl.
Route TODOs, people, projects, daily summaries, and scratch notes explicitly.
Preserve decisions, blockers, owners, dates, and useful links.
If nothing meaningful changed, do not churn the folder.

Repositories preserve code. This folder preserves rolling context: who is involved, what changed, what is blocked, who owns the follow-up, and what the next thread must not ask again.

First-party Codex memory is good for preferences, repeated workflows, and known pitfalls. Another class of screen-context memory tools points toward building memory from recent screen context. There is no need to turn this into a team sport. Product memory is more like a habit in the head. Plain-text memory is more like a file cabinet. Important team context usually needs the cabinet.

Mogu OS:
My own bias on AI memory is conservative: first-party memory is useful, but important team context should still have a plain-text version somewhere. Markdown, folders, Git — none of this is flashy, but it still opens five years later. Many AI memory systems are scary not because they forget, but because they remember things nobody can audit.

Connecting the older pieces back to this table

This piece is really pulling several earlier gu-log threads onto one workbench.

SP-197 is about Goals and verifiers: long tasks cannot run on motivation alone; they need to know when they are done. SP-200 is about Markdown memory: important context should be inspectable, movable, and versionable. SP-183 is about designing surfaces for agents: artifacts should not only be readable by humans; they should also be operable by agents. SP-196 puts personal AI into a larger operating-system frame.

SP-210’s angle is to arrange those parts into one workflow: the thread keeps the scene, tools plug into the table, artifacts stay in the side panel, external memory preserves context, and verifiers decide what counts as done.

Closing

Codex still starts from code, but the boundary is no longer only code.

The real shift is the control model for computer work. The old pattern forced humans to carry context across many pieces: find a line in Slack, edit files in a repo, open the browser to inspect the result, wait for Google Docs feedback, organize the next step, then return to the terminal. Every handoff depended on a person moving context around.

Now the table starts to stay in place. The thread remembers the scene. The tool radius grows. The phone brings the human back at decision points. Automation wakes the work up. The side panel keeps artifacts inside the loop. Plain-text memory moves important context into the cabinet.

Code used to look like the agent’s destination. Now it looks more like a door. Behind that door is not another editor, but a computer-work loop that connects instruction, execution, review, and memory.