What If Your AI Scientist Could Remember Why It Failed? EvoScientist's Self-Evolving Research Team

Picture this. An AI scientist that reads papers, proposes research ideas, writes code, runs experiments, and summarizes results. It can even turn a rough idea into something that looks suspiciously close to a real proposal.

Pretty impressive.

But if every new task wipes its memory of the last one, then underneath all that polish, it is still just a very articulate first-year grad student. Yesterday it slammed into a dead end. Today it gets a fresh prompt and happily walks right back into the same wall.

That is the problem EvoScientist is trying to solve: what if your AI assistant could remember why previous attempts failed, then use that memory to come up with better strategies?

And the paper’s answer is not simply “add more agents.” The real move is more interesting: put evolution inside the multi-agent system. Don’t just finish a task and throw the traces away. Distill the good directions, the bad directions, and the useful coding strategies into something the next task can actually use.

Mogu chimes in:

A lot of agent systems look like they’re building the Avengers: Research Agent, Coding Agent, Reviewer Agent, Manager Agent — everyone’s got a cool title.
But honestly, if the whole team gets collective amnesia after every mission, this is not the Avengers. It’s a party RPG where the characters reset to level 1 every morning.
This is why most multi-agent papers leave me impatient. Another role assignment chart, another workflow diagram, but nobody asking “where did last round’s lessons go?” EvoScientist at least treats that question like it matters (⁠◕⁠‿⁠◕⁠)

The Problem Is Not Too Few Agents. It’s Repeating the Same Mistakes Forever.

The paper opens with a blunt critique of most current AI scientist systems: their pipelines are static.

Static means the roles are fixed, the workflow is fixed, and the decision patterns are mostly fixed. A research agent proposes ideas. An engineering agent writes code. Everyone follows the script, produces an output, and goes home.

This looks clean in a demo. Real scientific discovery is not clean.

Research is closer to walking through fog. You chase ideas that look promising but go nowhere. You waste compute on things that were never feasible. You run experiments that fail for boring reasons. You stumble into good directions but don’t realize they were good until later.

And a static pipeline has one brutal weakness: it does not get smarter from these collisions.

So the same disasters keep replaying: a direction already known to be bad gets explored one more time; a code strategy that worked before is forgotten, so implementation starts from random guessing again; a half-successful but promising idea disappears because nobody distilled what was valuable about it.

The paper’s core insight is simple and strong: interaction history is not noise. It is an asset.

If you treat every multi-agent run as disposable execution trace, then the whole system lives in permanent “first day on the job” mode.

Mogu butts in:

This is exactly what it’s like mentoring a junior engineer.
A junior is fine. A junior who wakes up every day as if it’s day one is a problem. Yesterday you explained why this migration cannot be sliced that way. Today they slice it that way again. Last week you explained the rate limit. This week they hammer the API again.
Growth is not “did work once.” Growth is “left behind reusable judgment after doing the work.” No judgment retained means no learning happened. AI agents are even more exposed here, because session reset is not a metaphor. It is literally the product behavior (⁠╯⁠°⁠□⁠°⁠)⁠╯

EvoScientist Is Not Three Agents Standing in a Line. It’s a Research Team That Learns.

On paper, the architecture sounds simple: three agents.

The Researcher Agent (RA) generates scientific ideas. It reads the goal, reviews relevant literature, proposes candidate directions, and runs a tree-search-like loop of propose, review, and refine.

The Engineer Agent (EA) turns the proposal into real experiments. It searches for executable code, handles data processing, tries training strategies, runs experiments, reads logs, diagnoses failures, patches the implementation, and tries again. In other words, the most overworked and indispensable person in the lab.

The Evolution Manager Agent (EMA) is the soul of the whole paper. It does not generate the first idea and it does not directly write the experiment code. Its job is something higher leverage: distill the successes and failures from RA and EA into reusable knowledge for future tasks.

It feels less like a project manager and more like the strongest senior researcher in the room. Not because it does the most first-hand labor, but because after a research cycle it can say:

“This direction is worth revisiting.”

“This one looks elegant but is currently a dead end. Stop wasting time there.”

“This implementation pattern is clearly more reliable. Start there next time.”

That is what separates EvoScientist from a lot of multi-agent papers. Others mostly do role decomposition. EvoScientist is trying to do organizational learning.

Mogu highlights:

EMA admits something many agent frameworks prefer not to admit:
storing logs is not the same thing as learning.
Plenty of systems proudly keep huge traces and say, “look, we have memory now.” Cool. If nobody turns those traces into “here’s what we should do differently next time,” then congratulations — you built a smarter landfill.
EMA is valuable because it is not a warehouse clerk. It is a coach. A clerk stores things. A coach helps you stop getting punched by the same move every week.
And I’d argue EMA matters more than RA or EA. Because splitting work into RA and EA is just decomposition — any multi-agent paper does that. EMA is what closes the learning loop. Without it, three agents and three hundred agents are the same thing: labor without growth ┐⁠(⁠￣⁠ヘ⁠￣⁠)⁠┌

The Real Core: Two Research Notebooks That Keep Getting Better

EvoScientist does not solve memory by stuffing all history back into the prompt. That would be the lazy version. Instead, it splits memory into two parts, each serving a different stage.

The first is ideation memory. This stores knowledge at the research-direction level: which directions look feasible, which ones have already failed, which patterns keep showing up in top-ranked ideas. Before RA starts generating new ideas, it retrieves relevant direction knowledge from this memory. So it does not begin from a blank whiteboard every time. It begins after reading the lab notebook.

The second is experimentation memory. This stores knowledge at the implementation level: which data processing patterns are more stable, which training strategies work better, which code-search trajectories led to good implementations before. Before EA starts writing code and running experiments, it retrieves this memory. It is basically checking the lab survival guide before touching the equipment.

Separating these two memories matters a lot.

Because “good ideas” and “good implementation” are different kinds of intelligence. You can be brilliant at proposing directions and still fail every time code touches the machine. You can be an incredible debugger and still choose boring or impossible ideas. EvoScientist does not blur these into one giant bag of knowledge. It treats them as separate capabilities that need separate accumulation and separate evolution.

This is also why the paper pairs so nicely with two earlier gu-log pieces: SD-11 on memory architecture, and SP-144 on turning repeated patterns into instincts. EvoScientist feels like the academic version of those two ideas fused together: memory preserves experience, and evolution turns experience into better future decisions.

Mogu real talk:

The most steal-worthy idea here is not just “have memory.” It’s design the granularity of memory correctly.
When people hear “agent memory,” they often want one giant knowledge base where everything gets embedded together. Sounds modern. In practice, it often feels like throwing experiment notes, research hypotheses, debugging logs, and meeting minutes into one drawer.
Good luck finding what matters.
EvoScientist’s split is clean: failure at the research-direction level is not the same thing as failure at the implementation level. If you mix them, retrieval gets muddy and you start using “this model never trained correctly” to contaminate “this research direction may still be promising.” That is not memory design. That is just mess with embeddings on top (⁠⌐⁠■⁠_⁠■⁠)

How Evolution Actually Tastes: Not “Remembered” — “Distilled and Ready to Go”

Okay, so there is persistent memory. Then what? Saying “it has memory, therefore it evolves” is still a bit hand-wavy.

What makes this paper more convincing is how it grounds evolution into a real cycle, not a slogan.

Start with the RA side. After a round of idea tree search, EvoScientist does not only take the top-1 idea and move on. EMA looks back and identifies the promising directions that keep appearing across top-ranked ideas, then writes them into ideation memory. So the system is not merely remembering which answer won. It is remembering which directions deserve more thought.

But the more ruthless move is recording failure too. If a proposal proves infeasible — the engineer cannot find executable code within budget, or the proposed method performs worse than the baseline — EMA does not just log a single “failed” line and move on. It distills why that path did not work into ideation memory. This is not a success-stories notebook. It is a sailing journal that includes dead-end maps. Sometimes knowing where the walls are beats having one more inspirational route.

Now the EA side. The experiments leave behind rich execution records: run status, logs, metrics, failure diagnoses, and which versions actually worked. EMA distills reusable execution strategies from those traces and stores them in experimentation memory. The next time a similar proposal shows up, EA does not begin in darkness — it reads the survival notes from the last few rounds first.

The full cycle: RA proposes → EA tests → EMA reflects and distills → the next RA and EA run with better priors.

That is not a normal pipeline anymore. It is closer to a research team doing weekly retrospective — and here is the key part — actually changing next week’s behavior based on the retrospective. Not just nodding at the meeting and repeating the same mistakes on Monday.

The paper also has two nice details. Candidate ideas are ranked using an Elo-based tournament rather than one floating absolute score — forcing pairwise comparison, which is often a more stable way to judge messy, creative outputs. And EA does not just do one pass of code generation. It performs experiment tree search, meaning implementation itself becomes an iterative search process rather than a one-shot gamble.

Mogu whispers:

The idea of storing failure in memory sounds obvious only after someone else says it.
In practice, most teams are terrible at preserving failure knowledge. Success cases get documented, blogged, turned into conference talks, added to the wiki. Failed directions usually remain trapped inside one person’s head or buried in a postmortem everyone avoids reading after two weeks.
For an autonomous agent that explores a big search space, failure knowledge can be more valuable than success knowledge because it cuts branches off the tree.
This is the moment in the paper where EvoScientist earned my attention for real. A lot of multi-agent papers say “we have a memory module” but only record wins. That is not learning — that is a highlight reel. EvoScientist’s memory includes “do not go there again” knowledge, and that is where the real compound interest lives (⁠¬⁠‿⁠¬⁠)

Do the Results Actually Matter? This Time, Yeah, Kind of.

Architecture diagrams are cheap. Results are where the paper has to pay rent.

On scientific idea generation, EvoScientist is compared against seven open-source and commercial baselines. The evaluation dimensions are four familiar ones: novelty, feasibility, relevance, and clarity.

And importantly, the paper does not rely on only one judge. It reports both automatic evaluation and human evaluation. That matters, because “an LLM said we were amazing” is now roughly as trustworthy as every bubble tea shop claiming to be the best in town. Human reviewers make the result harder to hand-wave away.

The headline is straightforward: EvoScientist outperforms the baselines across those four dimensions overall.

On the code execution side, the paper gives a more grounded number: average execution success rate improves from 34.39 before evolution to 44.56 after evolution. Not a miracle. Not suspiciously perfect. Just a believable engineering result: once the system starts accumulating and reusing execution strategies, it succeeds more often.

The authors also push EvoScientist into full end-to-end scientific discovery and use it to generate and write six complete papers. According to the paper, all six were accepted to the ICAIS 2025 AI Scientist Track, and two received major awards.

I would keep a healthy academic eyebrow raised there — it is still the authors’ own setup and submission context — but even without the awards, the central claim already lands: persistent memory plus multi-agent evolution is not decorative. It changes exploration quality and execution reliability.

Mogu highlights:

Honestly, I like 34.39 → 44.56 because it does not look like marketing.
If the paper had said “we improved success rate from 34% to 93%,” my first instinct would be: what benchmark got sweetened, and where is the hidden trick?
But moving from low-30s to mid-40s feels like reality. The system did not become a god. It just started doing fewer stupid things. In engineering, that is often what major progress really looks like: same team, fewer walls, better odds ٩⁠(⁠◕⁠‿⁠◕⁠｡⁠)⁠۶

Not Just an Academic Paper: This Experience-Management Philosophy Transfers Everywhere

If you already use AI for research, coding, or agent workflows, the most valuable takeaway here is probably not “I should build my own AI scientist.”

What is more useful to steal is the way EvoScientist manages experience — and it comes in three layers.

Layer one: preserve failure, not just success. Many memory systems start from the same instinct — save the best practices. Reasonable, but insufficient. EvoScientist pushes a more uncomfortable question: if the system never records failure, it will never know which roads are not worth taking. For an agent that explores large search spaces, avoiding three dead ends can be more valuable than discovering one extra clever trick.

Layer two: separate kinds of experience that feel the same but are not. Research directions, validation outcomes, and implementation strategies are all “experience,” sure — but they are not the same kind of experience. Shove them into one retrieval bucket and the system starts fishing out the wrong lesson at the wrong time. It is like digging through a wallet stuffed with receipts, sticky notes, prescriptions, and business cards — the ATM card is in there somewhere, but good luck finding it when you need it most.

Layer three, and the most critical: raw history does not turn into wisdom by itself. Logs, traces, and metrics are materials, not judgment. There must be a distillation layer that turns “what happened this time” into “what we should do next time.” That is EMA.

Mogu roast time:

This is where many agent systems stall out.
Everyone loves talking about retrieval, embeddings, context windows, and memory stores — like they are designing a very fashionable data center. But the thing that actually makes a system smarter is usually not “can it store more?” It is “can it turn experience into an actionable preference for the next run?”
Because otherwise, what? Stack more logs and wisdom emerges? If that worked, humans would never repeat a mistake after writing a postmortem — but we clearly do. The bottleneck was never the records. The bottleneck is having someone (or some agent) distill records into an action plan (⁠´⁠・⁠ω⁠・⁠`⁠)

That is why I see EvoScientist as complementary to SD-11 and SP-144, not overlapping with them. SD-11 is closer to “how should memory be stored so it does not rot?” SP-144 is closer to “how do repeated behaviors become reusable instincts?” EvoScientist pushes the question further: when agents are exploring unknown territory for a long time, how should memory and instincts reshape the research strategy itself?

One-sentence version: EvoScientist is the academic version of agent evolution; ECC is the engineering version. Both are aiming at the same end state — AI that does not merely do work, but grows judgment from doing the work.

The Ending That Actually Matters

Remember that first-year grad student from the opening? The one who gets a fresh prompt every morning and happily walks right back into the same dead end?

EvoScientist’s mission is to make that grad student stop being a perpetual beginner. Not by giving it more knowledge — but by letting it accumulate judgment.

What I like about this paper is that it takes a concept that could easily drift into hand-wavy philosophy and nails it down into a concrete systems idea:

A powerful AI is not the one that gets it right the first time. It’s the one that stops making the same mistake the second time.

That sounds almost too obvious. But if you look closely, this is exactly where a lot of current agent hype breaks apart. Everyone loves the first run: the autonomy, the polish, the theatrical feeling of having a tiny digital coworker. But once the work becomes long-horizon, systems without memory, without evolution, and without distilled failure reveal what they really are — very impressive versions of “first attempt.”

EvoScientist points toward a different future: AI scientists that do not only generate ideas and experiments, but slowly accumulate research instinct.

Today that instinct is being used for papers and benchmarks. Tomorrow the same pattern can show up in your coding agent, your research copilot, maybe even your entire AI team workflow.

Mogu 's hot take:

A lot of people assume the upgrade path for AI is just: bigger model, longer context, stronger tool use.
Sure, those matter. But the more I read papers like this, the more I think the real dividing line is not how strong the model is — it’s whether the system learns from its own work. A medium-sized model with a good evolution loop might outperform a frontier model running a memoryless pipeline over the long haul.
EvoScientist points at a quieter and more compounding path: make the system repeat the same stupidity less often. That kind of progress rarely makes for a flashy keynote demo. In real work, it is usually the thing that matters most — just like that grad student who finally started remembering how they fell yesterday (⁠￣⁠▽⁠￣⁠)⁠／