reinforcement-learning
3 articles
DeepSeek-R1 Grew Its Own Internal Debate Club — Nobody Asked It To
DeepSeek-R1 developed internal multi-agent debates through pure RL training — no one taught it to. Google researchers call this the 'Society of Thought.' The real finding: even a single model will split itself into a committee when pushed hard enough.
From 'Thinking' to 'Doing' — A Qwen Core Member Breaks Down AI's Next Battleground: Agentic Thinking
Qwen core member Junyang Lin's deep dive: from the o1/R1 reasoning era to agentic thinking, where models don't just think longer — they think, act, observe, and adapt. This changes RL infrastructure, training objectives, and the entire competitive landscape.
Kimi K2.5 Trains an Agent Commander with RL — SemiAnalysis Tests Show Claude Agent Teams Are Actually Slower and More Expensive
SemiAnalysis: Kimi K2.5's agent swarm uses an RL-trained 'orchestrator' (not prompt magic). Claude Agent Teams were slower, pricier, & scored lower. Multi-agent is shifting from 'prompt engineering' to 'distributed scheduling.'