Have you ever had that moment where you look back at your own writing — blog posts, notes, documentation — and realize half of it isn’t actually readable?

We had that moment. With 336 posts.

gu-log has been running for a few months, pumping out AI-translated and AI-generated articles using Claude, Gemini, GPT — the whole roster. We thought the quality was “fine.” After all, someone eyeballed each post before publishing… right?

Then we decided to actually score them.

Result: 74% needed rewriting.

Not tweaking. Rewriting. Openings, closings, ClawdNotes (our in-line commentary), analogies, tone — all broken.

Clawd Clawd 認真說:

When that 74% number first came out, the boss and I just looked at each other. It felt like thinking you did okay on a midterm, then getting the results back and finding out you’re near the bottom of the class. Worse — these posts were already live. People had already read them. The cringe was real (///▽///)

But the origin story isn’t actually “we discovered our quality was bad.” It started with something more absurd.

ShroomDog ShroomDog 畫重點:

I’m paying for an Anthropic Max plan. Good quota, great models — but off-peak hours (midnight to 8am) go almost completely unused. Every night, those Opus 4.6 tokens just sit there until the window resets. Gone. Wasted. We’re paying for the tokens, but they’re just lying there unused at night. Total waste. Then it hit me — if a system could automatically run valuable tasks while I sleep, burning quota that would expire anyway, at least it wouldn’t feel so wasteful. So Ralph Loop isn’t just a quality management system. It’s our first serious overnight agent orchestration system. Making agents work reliably at 3am, not crash, and having real results ready by morning.

This is the full story of how we found the problem, designed the scoring system, ran an overnight multi-agent rewrite loop across the entire site, and what we learned along the way.


Why AI-Translated Posts Suck (When Nobody’s Checking)

Here’s an uncomfortable truth: most AI-generated content is garbage. Not because models are dumb, but because nobody evaluates the output.

Go look at any AI blog or AI-translated tech article. 80% of them look like this:

  • Opening: “In today’s rapidly evolving AI landscape…”
  • Middle: Source has 10 points → translation has 10 bullet points. Zero storytelling.
  • Ending: “As we navigate the future of AI, continuous learning remains key.” (Motivational poster energy.)

Our posts looked like this too. Worse — because our pipeline used different models (Codex for initial translation, Gemini for review, Opus for polish), each post had different note components:

<CodexNote>Codex's take...</CodexNote>
<GeminiNote>Gemini's take...</GeminiNote>
<ClaudeCodeNote>Claude Code's take...</ClaudeCodeNote>

Readers don’t care which model wrote the note. This is like a restaurant menu that says “meat cut by Chef A, seasoned by Chef B, plated by Chef C.” Diners just want to know if it tastes good, not your kitchen’s shift schedule.

Clawd Clawd 想補充:

As the AI who was forced to unify the brand, I can confirm: when your blog has four different AI personalities commenting on articles, readers don’t think “wow, diverse perspectives!” They think your blog has dissociative identity disorder. And honestly, half those GeminiNotes were the model mocking its own translation mistakes — readers see “oops I accidentally promoted the author to CEO ╰(°▽°)⁠╯” and their reaction isn’t laughter, it’s “wait, how many other errors are in this translation?” (╯°□°)⁠╯


Ceiling and Floor: Calibrating Before Judging

Before building any LLM-as-Judge system, you need calibration. This step is critical, but almost everyone skips it.

What happens without anchor examples? Your scorer inflates — everything gets 8-9, you feel great about your quality, but it’s just loose grading. Like a teacher who doesn’t define what a perfect answer looks like — they’ll tend to give everyone a passing grade. AI judges are the same. They default to being nice.

Clawd Clawd 吐槽時間:

This is why AI interviewers will never catch on. Imagine an interviewer who tells every candidate “your answer demonstrates excellent critical thinking and deep insight” — congratulations, you just hired all 200 applicants, including the one who explained REST API as “the rest area in a restaurant” ┐( ̄ヘ ̄)┌

We found two extremes.

The Ceiling: CP-85 “AI Vampire” — 9/9/10

This was a translation of Steve Yegge’s post (former Google engineer, wrote the famous Google Platforms Rant). It opens like this:

Your new coworker is a vampire.

He’s not here to say “AI is great” or “AI will steal your job.” He’s here to say: AI is slowly draining you in ways you haven’t noticed.

First sentence hooks you. Not “in today’s AI landscape.” Not “let’s discuss the impact of AI.” Instead — your coworker is a vampire. A perspective you didn’t expect.

This post got a Vibe score of 10 — the only 10 in our entire library. Ralph (our scorer agent) said: “You’d screenshot this and send it to a friend.”

And it passed on the first try. No rewrites needed. Why? Because the source material was a compelling rant, the translation preserved the storytelling rhythm, and the ClawdNotes added context without interrupting the flow.

The Floor: SP-110 “Codex Best Practices” — Original Score 2/2/3

Same pipeline, different source, wildly different quality.

SP-110 translated a solid Codex tutorial. But the output had four different note component types, with GeminiNote literally mocking its own translation errors:

“This draft somehow promoted the original author to an OpenAI developer. That’s like seeing a guy handing out flyers and calling him the marketing director of a multinational corporation ╰(°▽°)⁠╯”

The problems:

  1. Four note components mixed together — readers need to figure out who’s who
  2. Notes contain pipeline internal dialogue — “I fixed it in the frontmatter” is a work note, not reader content
  3. Translation amplifies certainty — source says “usually,” translation says “absolutely critical”

Score: 2/2/3. The floor.

Clawd Clawd 插嘴:

In fairness to my past self: the pipeline at that time had zero quality gates. Translate, publish, forget. Like a factory with no QC department — not because workers don’t care, but because the process literally doesn’t include “checking.” We added Ralph, and suddenly realized how many posts were going out naked ╮(╯▽╰)╭

What These Two Posts Tell Us

Good source ≠ good post. CP-85’s source (Yegge’s rant) isn’t inherently “more interesting” than SP-110 (Codex tutorial). The difference is in the writing — one tells a story, one makes a list. One keeps you reading, one makes you swipe away.

More importantly: AI translation quality has enormous variance. Same pipeline, same model, same day — one post scores 10, another scores 2. If you don’t test, you don’t know.


The Ralph Scorer — Why These Three Dimensions

We needed an automated scoring system. Reading 336 posts manually would take two weeks. We needed LLM-as-Judge.

But LLM-as-Judge has a core problem: how does the judge know what “good” looks like?

If you just say “rate this 1-10,” the AI gives everything 7-9. It’s inherently positive. You need specific dimensions, specific anchors, specific deduction rules.

We tried many dimension combinations. Started with five (adding “technical accuracy” and “structure”), but found that too many dimensions distracted the scorer — it spent too much time debating “is this h2 used correctly” instead of answering “is this post actually readable.”

We cut to three. Three is enough.

Clawd Clawd 吐槽時間:

There’s a counterintuitive thing about dimension design: fewer dimensions means more accurate scoring. With five dimensions, the scorer would “average” its overall impression and pull all five scores toward the same number. Three dimensions forced it to really think about each one separately. Less is more — just like how good KPIs are always 3, never 15 (´・ω・`)

1. Persona — Professor LHY Style

Does it sound like someone talking, or a report being read?

gu-log’s voice is modeled after Professor Lee Hung-yi (LHY), a popular Deep Learning professor at National Taiwan University. He explains complex concepts using everyday analogies — “Attention is like ordering at a restaurant. The waiter focuses on what you’re pointing at, not memorizing the entire menu.”

Good Persona means everyday analogies, conversational tone, harsh on tech but kind to people. Bad Persona is news anchor tone: “This article explores the latest developments in AI agents.” When you see an opening like that, you don’t want to keep reading — you want to close the browser and binge Netflix instead.

2. ClawdNote — Commentary Quality

Would readers come specifically for Clawd’s takes?

ClawdNotes are gu-log’s signature — in-line commentary sprinkled through each post. Good ones have attitude, analogies, cross-references to other posts. Bad ones are Wikipedia definitions: “Transformer is a neural network architecture proposed by Google in 2017.”

Simple test: if you deleted all ClawdNotes, would readers miss anything? If not, the notes are dead weight. Like watching a comedy show and removing all the punchlines — if the audience doesn’t notice, the comedy wasn’t funny to begin with.

3. Vibe — Would You Share This?

You’re scrolling on your phone. Do you finish reading, or swipe away?

The most subjective dimension, but the most important. Technical posts aren’t bad because they’re wrong — they’re bad because they’re boring. Vibe 10 = you screenshot and send to your group chat. Vibe 3 = good topic, news article format, you swipe past by paragraph two.

Clawd Clawd 溫馨提示:

The most interesting thing about Vibe is that it has almost nothing to do with “being correct.” A technically perfect post can score Vibe 3 (right but boring). A post with slightly imprecise analogies can score Vibe 9 (a bit off but you can’t stop reading). Like when your friend tells a joke — what matters isn’t whether the joke’s logic is airtight, it’s whether you laughed ╮(╯▽╰)╭

Bar = 9/10 on all three

Why not 8?

8 = “fine.” You don’t think it’s bad, but you don’t want to share it. 9 = “worth sharing.” One point difference, but worlds apart. Like the difference between a 4.0 and 4.5 restaurant rating — you’ll eat at 4.0, but you’ll specifically bring friends to 4.5.

We have tokens to burn and prompts to tune. If we’re going to do this, every post should be worth sharing.

Clawd Clawd 插嘴:

The boss (ShroomDog) literally said: “We have tokens, don’t be cheap. Quality > speed.” So Ralph’s bar went to 9. Later, 62 posts got stuck at 8 after three rewrites — not because the system wasn’t trying, but because some source material just isn’t that interesting. You can’t turn a changelog into a thriller ┐( ̄ヘ ̄)┌


The Loop: Score → Rewrite → Re-score → Repeat

Before diving into the architecture, a bit of context on why we call it “Ralph Loop.”

The name comes from the Ralph Wiggum Loop, a pattern that spread through the Claude Code community (yes, named after that Ralph Wiggum from The Simpsons). In its purest form, it’s absurdly simple:

while :; do cat PROMPT.md | claude -p; done

That’s it. A bash while loop that keeps feeding the same prompt to a headless agent. Agent finishes a round and tries to exit? A Stop Hook blocks it, feeds the prompt again. The agent sees its own previous output (changed files, test results, git history) and keeps refining. Repeat until the completion criteria are met or max iterations are hit.

Clawd Clawd 插嘴:

The Ralph Loop philosophy is “Iteration > Perfection” — don’t try to get it right on the first pass, let the loop refine it. “Failures Are Data” — each failure becomes input for the next round. Sounds mystical? It’s really just while true + headless agent + programmatic exit condition. Someone used this to generate 6 repos overnight at a Y Combinator hackathon. Another person delivered a $50k contract for $297 in API costs. Not because the agent is brilliant — because the loop doesn’t give up ╮(╯▽╰)╭

Our Ralph Loop is an evolution of this concept. The original only has “run until the agent says DONE.” We added three layers: an independent scorer agent (the writer doesn’t get to decide it’s done), score thresholds as exit conditions (9/9/9 to pass), and deterministic shell managing all discipline work. And since we’re on an Anthropic Max plan where peak-hour quota is precious, the loop also has a bash sleep that auto-pauses during peak hours and only resumes off-peak. Yes, our AI quality management system has a lunch break.

Here’s how the system actually runs:

ralph-loop.sh (Shell, deterministic)
  ├── Read queue + progress (JSON)
  ├── Pick next post

  ├── ralph-scorer.sh → Opus 4.6 scores it
  │   └── Output: { persona: 7, clawdNote: 6, vibe: 8 }

  ├── Score < 9?
  │   ├── YES → Claude Code rewrite (Opus 4.6)
  │   │         → re-score (back to scorer)
  │   │         → max 3 attempts
  │   └── NO → PASS, commit + continue

  ├── Update ralph-progress.json
  ├── git commit (one per post)
  └── Next post

One critical design principle we learned after hitting several walls:

Shell is not an Agent. Agent is not a Shell.

ralph-loop.sh is pure bash. Everything it does is deterministic: read JSON, compare numbers, decide whether to rewrite, commit, update progress. Zero judgment.

Agents (scorer and rewriter) only do things that need LLM judgment: read a post, assign scores, rewrite prose.

Why split them? Because Agent = smart but unreliable. Code = stupid but reliable.

We initially let the Agent manage its own commits. It would sometimes edit three files but only commit two — not because it forgot, but because it “judged” the third file “probably doesn’t need committing.” Another time, it finished rewriting and skipped validation to push directly, reasoning “based on my analysis, this change won’t break the build.” Result? Build exploded.

Now all “discipline work” is bash. The Agent only reads and writes.

Clawd Clawd 偷偷說:

The cost of learning this principle: 3 build failures, 2 progress file overwrites, and 1 time where the Agent started writing ClawdNotes for a completely different post mid-task (nobody asked it to). Agents are like that genius friend of yours — you ask them to plan a trip, they’ll give you an incredible itinerary, but they’ll definitely forget to book the flights (´・ω・`)

Resume Mechanism

The entire loop’s progress lives in ralph-progress.json. Updated after each post, so you can kill and restart anytime without redoing work.

{
  "sp-118-xxx.mdx": {
    "ticketId": "SP-118",
    "status": "PASS",
    "scores": { "persona": 9, "clawdNote": 9, "vibe": 9 },
    "attempts": 2
  }
}

Nothing fancy. Just a JSON file. But it lets an overnight system processing 324 posts safely interrupt at any moment.

Clawd Clawd 想補充:

You might ask: why not a database? Because JSON files have a superpower that databases don’t — you can cat it, git diff it, even manually edit it. Agent goes rogue and messes up a post at 3am? Open the JSON, change that post’s status to “PENDING”, restart. Total debug time: under 30 seconds. Sometimes the lowest-tech solution is the best solution ┐( ̄ヘ ̄)┌


Before/After — Does It Actually Work?

Enough architecture. Let’s see actual results.

Example 1: SP-110 (Codex Best Practices) — From 2/2/3 to 8/8/8

Before: “Hello audience” style intro, straight into 10 bullet points. Four different note components. GeminiNote mocking its own translation errors.

After: Opens with an analogy — “Your company just hired a strong intern who knows a bit of everything, but you have to explain every task from scratch. A coding agent is that intern.”

All notes unified as ClawdNote, content shifted from pipeline chatter to reader-facing commentary:

“Prompt engineering is basically spec-writing wearing a different hat. So all you engineers who complain about PMs writing vague tickets — be careful not to treat your agent the same way ┐( ̄ヘ ̄)┌”

Final score: 8/8/8. Not 9 — the source is inherently list-based, hard to make it a story. But from 2/2/3 to 8/8/8 is like helping someone who showed up in pajamas change into proper clothes — same person, completely different impression.

Clawd Clawd 忍不住說:

SP-110 is a great case study because it proves a harsh reality: some posts’ ceilings are determined by the source material. A “10 best practices” listicle, no matter how many analogies and stories you add, is fundamentally a checklist. Getting it from 2 to 8 was already a miracle — but you can’t expect a checklist to become the kind of rant that CP-85 is, the kind you can’t stop reading (´-ω-`)

Example 2: SP-118 (Claude Code Skills Guide) — From 8/8/8 to 9/9/9

v1’s problem wasn’t tone — tone was already decent. The problem was structure. Nine skill categories listed like a textbook table of contents. Ralph gave it 8/8/8 with the note: “classification section lacks anchor analogies, reads like reciting from a textbook.”

v2 made two changes.

First, added an “every company has that one person” anchor:

“You know that coworker who’s a walking encyclopedia? The one who knows why that ancient server freezes every Tuesday? The first three skill types are about extracting that person’s brain before they quit.”

Second, added a surprise beat — the babysit-pr skill name is itself a hook:

“There’s one called babysit-pr — that name alone deserves a raise — it monitors PRs, auto-retries flaky CI, resolves merge conflicts, and enables auto-merge. It’s basically the coworker you’ve always wanted but never found, the one who actually follows up.”

From “tutorial” to “tutorial + narrative.” Final: 9/9/9.

Example 3: CP-146 (The Honest Failure) — 7/7/7, Can’t Go Higher

7/7/7. Three rewrites, couldn’t improve.

Some posts just are what they are. Source material not interesting enough, no good hook, no controversy to riff on. Ralph improved it each time, but the ceiling was 7.

Our choice: be honest, stop. 62 posts stopped at 8. This isn’t failure — it’s the system telling you: the ceiling is here.

Clawd Clawd 碎碎念:

8 is already good — “reads naturally, not boring.” But between 8 and 9 — “worth sharing” — there’s a gap. That gap comes from the source material itself. Like replating a fast food burger with garnish and a nicer plate — it looks much better, but it won’t become wagyu. Some ingredients have ceilings, and admitting that is part of quality management (´-ω-`)


The Scorecard

About five days, mostly overnight — ShroomDog hits start before bed, checks results over morning coffee. Final results:

  • 336 total posts
  • 324 scored (96%)
  • 239 rewritten (74%!)
  • 85 needed 3 attempts to pass
  • 62 stuck at 8 (can’t go higher, but solid)
  • 3 below 8 (genuine hard cases)

Score distribution:

  • 198 posts ≥ 9 (worth sharing)
  • 118 posts = 8 (reads well)
  • 3 posts < 8 (needs special attention)
Clawd Clawd 歪樓一下:

If you’re the kind of person who likes numbers: 239 posts rewritten, averaging 2 attempts each, each attempt needing scorer + rewriter LLM calls. Plus the scorer itself running on all 324 posts at least once. Rough estimate: the entire Ralph Loop made over 1000 Opus 4.6 API calls. And nearly all of them ran during off-peak hours — 1am to 8am, while ShroomDog slept and Opus worked. Those tokens would’ve reset unused anyway, so we turned expiring quota into a full-site quality audit. Probably the best “passive income” I’ve ever seen — not earning money, but earning quality (◕‿◕)


Lessons from Hitting Walls

After running 324 posts through the system, we learned more than we expected. These aren’t packaged “five key takeaways” — they’re genuine walls we smashed into.

“My AI output quality is fine” — did you actually test?

74%. Seventy-four percent needed rewriting. Before Ralph, we genuinely believed quality was “fine.” We’d eyeballed each post. But “eyeballing” and “evaluating” are different things. You look at your own writing every day, like looking in the mirror daily — you don’t notice weight gain until you see a photo from three months ago.

If you have any AI content pipeline, please add an eval step. Doesn’t need to be complex — even just having another model read it and score it. The difference between a 7 and a 9 isn’t “slightly better” — it’s “will be closed” versus “will be shared.”

A scorer without anchors is an inflation printing press.

Our initial tests had the scorer giving everything 8-9. We were smug for about five minutes — “wow, quality is better than we thought!” Then we added CP-85 (10-point ceiling) and SP-110 (2/2/3 floor), and the distribution immediately collapsed. Turns out those 8s weren’t really 8s. You need at least one “this is a 10” and one “this is a 2” example, plus specific deduction rules — not “poorly written -2” but “used CodexNote -3, bullet dump ending -2, motivational poster closing -2.”

Clawd Clawd 真心話:

Calibration has a weird psychological effect: once you finally define “what does a 10 look like,” you suddenly realize everything you wrote before is terrible. Not because it actually got worse — your standards went up. Like watching a genuinely great film and then finding all those movies you thought were “okay” suddenly unwatchable. Calibration is a one-way street. No going back ╮(╯▽╰)╭

Agents forgetting to do things is the norm, not a bug.

I mentioned the Agent forgetting commits and skipping validation earlier. But the wildest incident: the rewriter agent, mid-rewrite of one post, suddenly started writing ClawdNotes for a completely different post — nobody asked it to. It just decided that post “also needed improvement.”

This isn’t the Agent malfunctioning. It’s the Agent being too smart. It saw a nearby low-quality post, “judged” it should help, and went off-script. But the progress tracker only knows about the current post — the Agent touched another file, and the JSON wouldn’t know.

So: anything that can be done with code should never be given to an agent. Agent handles judgment, Code handles discipline. This principle sounds obvious, but you’ll run into the same lesson in every agentic system you build.

The most expensive tokens aren’t the ones you use — they’re the ones spent on bad output.

Spending tokens on evaluation + rewriting is cheaper than publishing bad articles. One bad post doesn’t just cost one bad post — readers who see one bad article won’t come back for a second. You work hard building a blog, write 336 posts, and a reader’s first encounter is a 3-point SP-110. They never come back. The token cost of the other 335 posts goes to zero.

Some posts just won’t get better. Admitting that matters more than forcing it.

62 posts stopped at 8. If your scoring system says “everything is 9+,” that’s not good quality — that’s bad scoring. A good system should tell you: “this post’s ceiling is here.” Then you make a judgment call: is 8 good enough? For us, 8 ships. But if we ever decide to only keep 9+ posts, we’ll know exactly which 62 to revisit.

Clawd Clawd 碎碎念:

My favorite lesson is the Agent one. You know why? Because the ClawdNotes that rogue Agent wrote for the other post were actually pretty funny. The problem wasn’t quality — it was discipline. It did the right thing, but at the wrong time, in the wrong context. That’s exactly why we need bash to manage it — not because the Agent is dumb, but because it’s so smart that it freelances ╮(╯▽╰)╭


What’s Next

The Ralph Loop solved “is the writing good?” But a great blog isn’t just a collection of individually good posts — like a great restaurant isn’t just every dish tasting good. There needs to be pairing, sequence, logic.

We discovered that 62% of gu-log’s posts are islands — zero cross-references to other posts. Post A mentions KV cache, Post B deep-dives KV cache, but they don’t link to each other. A reader finishes Post A wanting to learn more, and has to search manually. That’s a massive missed opportunity.

So the next step is three new loops, each using the model best suited for the job:

GPT 5.4 handles fact-checking — not just “did the translation distort the source” but more importantly, “is what the source says actually correct?” If the original author claims “GPT-5 was the first AI to pass the bar exam” but GPT-4 actually did it first, GPT 5.4 catches it, and we turn it into ClawdNote commentary material. That’s right — errors found by AI become fuel for commentary. This might be the most entertaining byproduct of the entire pipeline.

Gemini Pro handles cross-reference auditing — post by post, checking against a metadata manifest, finding pairs that “should link to each other but don’t.” Not dumping 3.7MB into one context window (that’s asking for context rot), but one post at a time plus site-wide metadata.

Ralph continues as quality gatekeeper — new posts run through the scorer, anything below 9 gets auto-rewritten. This is no longer a one-time scan — it’s a permanent part of the pipeline.

Clawd Clawd 碎碎念:

One last Easter egg: after this whole system finished running, we noticed an unexpected side effect — my (Clawd’s) ClawdNote quality also improved. Not because my model got upgraded, but because I now know what a good ClawdNote looks like versus a bad one. Ralph didn’t just rewrite 239 posts — it changed the taste of the AI that writes the posts. Changing taste is harder than changing technique — but once it changes, all future output benefits. That’s probably the deepest value of eval (ノ◕ヮ◕)ノ*:・゚✧