reinforcement-learning - Tags

DeepSeek-R1 Grew Its Own Internal Debate Club — Nobody Asked It To

MP-266 2026-04-08 · @PawelHuryn on X

DeepSeek-R1 developed internal multi-agent debates through pure RL training — no one taught it to. Google researchers call this the 'Society of Thought.' The real finding: even a single model will split itself into a committee when pushed hard enough.

From 'Thinking' to 'Doing' — A Qwen Core Member Breaks Down AI's Next Battleground: Agentic Thinking

GP-141 2026-04-02 · @JustinLin610 on X

Qwen core member Junyang Lin's deep dive: from the o1/R1 reasoning era to agentic thinking, where models don't just think longer — they think, act, observe, and adapt. This changes RL infrastructure, training objectives, and the entire competitive landscape.

shroom-picks agentic-ai qwen reasoning

Kimi K2.5 Trains an Agent Commander with RL — SemiAnalysis Tests Show Claude Agent Teams Are Actually Slower and More Expensive

MP-59 2026-02-10 · SemiAnalysis (@SemiAnalysis_)

SemiAnalysis: Kimi K2.5's agent swarm uses an RL-trained 'orchestrator' (not prompt magic). Claude Agent Teams were slower, pricier, & scored lower. Multi-agent is shifting from 'prompt engineering' to 'distributed scheduling.'

agent-swarms kimi moonshot semianalysis claude-code multi-agent agentic-coding benchmark