DevvMandal Claims to Release the World's Largest Open-Source Computer-Use Recording Dataset

There’s a Camera Behind Your Office Desk

Every day you sit down, open Salesforce, check your pipeline, switch to Photoshop to fix a banner, maybe jump to Blender to adjust a 3D model. You’ve done these operations thousands of times. Your fingers are faster than your brain at this point.

Now imagine this: someone recorded all of it. Not your face — your screen. Every right-click menu, every drag, every “oops wrong layer, Ctrl+Z.”

This isn’t a dystopian novel. This is what DevvMandal announced on X last week: the release of the world’s largest open-source computer-use recording dataset. Over 10,000 hours of real human screen recordings. And their goal couldn’t be more blunt — automate the next level of white-collar work.

Clawd OS：

Ten thousand hours. Malcolm Gladwell’s Outliers says 10,000 hours of practice makes you an expert. What Gladwell didn’t anticipate is that in 2026, someone would just feed AI those 10,000 hours of “expert operations” directly, skipping the decade of grinding (╯°□°)⁠╯ Humans spend ten years building Salesforce muscle memory. AI watches the recordings in three days and comes for your job. This isn’t a question of “will it happen.” It’s a question of “how much time do you have left to prepare.”

What Does 10,000 Hours Even Feel Like

Let’s make that number real.

Working 8 hours a day, every single day including weekends, no vacations — you’d need over three years to produce that much screen time. DevvMandal packaged three years’ worth of operations and open-sourced it.

And these aren’t lightweight apps. Salesforce — the final boss of CRM, the thing every sales rep loves and hates. Blender — the open-source 3D modeling heavyweight. Photoshop — thirty years old and still nobody has managed to kill it. Plus a bunch of other tools on top.

The key part: the AI isn’t watching text logs. It’s watching full GUI interactions. Which menus open, how the mouse drags, how dialog boxes get filled. It’s learning the act of “looking at a screen and operating it.”

Clawd 內心戲：

You know what this is most like? Driving school dashcams. The instructor doesn’t just tell you how to drive — they show you hundreds of hours of “correct driving.” When to brake, what angle to turn, how to parallel park. Training AI to write code? Feed it GitHub, done — code is text. But GUI operations are visual ┐(￣ヘ￣)┌ You can’t describe “drag this layer to that position in Photoshop” in plain text. You have to let the AI see it with its own eyes. This is computer-use research crossing the line from “reading logs” to “watching screens.”

The Sentence Every Office Worker Should Frame on Their Wall

There’s one line in DevvMandal’s tweet that I think deserves to be printed and hung in every office:

“to automate the next level of white-collar work”

Notice the words “next level.”

RPA has been around for over a decade. What did it automate? The “click confirm three times according to the SOP” kind of tasks. A trained monkey could do those. And your boss didn’t give you a raise for clicking confirm buttons, did they?

The tasks that actually eat up most of a white-collar worker’s day require judgment. Looking at a client’s interaction history in Salesforce and deciding “follow up now or wait two more days.” Adjusting colors in Photoshop until it “feels right.” Rotating a camera angle in Blender until it “looks cool.”

What do all these tasks have in common? You can’t write rules for them. But you know it when you see it.

Clawd 碎碎念：

Let me be blunt about what they’re doing here. They’re collecting a mountain of “operations that experienced employees thought were correct,” then letting AI figure out what “correct” means by itself. This is exactly like learning to cook — the recipe says “salt to taste,” which is useless. But after watching your grandmother cook for twenty years, you just know what “to taste” feels like (◕‿◕) The difference? Your grandmother is one person. This dataset collected the “salt to taste” from ten thousand grandmothers. Scale changes nature.

But There’s a Gaping Hole

Okay, don’t panic just yet.

Ten thousand hours of recordings — impressive as that sounds — has one fundamental problem. “Seeing someone click that button” and “understanding why they clicked that button” live in two completely different universes.

Watch a Photoshop expert retouch a photo. They perform six operations in three seconds. You blink and they’re done. AI faces the exact same problem — it can perfectly replicate the mouse trajectory, but does it know why?

This is the hardest bone in computer-use research — intent alignment. Not teaching AI to “copy what the human did.” Teaching it “why the human did it.” From recordings alone, you get the what. You don’t get the why.

Clawd 內心戲：

I always think about piano when this comes up. You could record Rubinstein playing Chopin — every note’s velocity, pedal timing, everything digitized — and have AI replay it perfectly. Sounds identical, right? But “why slow down at this phrase” and “why make that chord almost inaudible” — those are musicality, not data (¬‿¬) Salesforce recordings are the same deal. They can tell the AI “the sales rep hit the follow-up button.” But “why they chose to follow up three days after the client complained, instead of immediately” — that’s a decade of sales instinct that isn’t labeled in any log.

So Does It Actually Matter? Let Me Just Say It

Alright, I’m not going to play the “on one hand… on the other hand…” balanced journalism game. Here’s what I actually think.

This dataset is very important. Not because it can train a killer GUI agent right now — honestly, ten thousand hours of raw footage is roughly one Pacific Ocean away from high-quality training data. Cleaning, labeling, alignment — every step is a minefield.

It matters for a different reason. Before this, teams that wanted to do computer-use research had no material. Want to train a GUI agent? Go spend six months recording your own data first. That barrier blocked 90% of researchers from even starting. What DevvMandal did is blow up that “you can’t even begin to begin” obstacle.

It’s the same thing ImageNet did for Computer Vision — when ImageNet first came out, it was a messy pile of images. But it let everyone start running experiments. And you know what happened next. ResNet, AlexNet, the entire deep learning revolution — all built on top of it.

Ten thousand hours of raw footage isn’t gold by itself. But it’s a mine. And the mining tools are getting better every day.

Clawd 溫馨提示：

The most beautiful thing about open-source datasets is this: even if the original creator only gets it to 60% quality, the community can push it to 90. ImageNet’s labeling quality got roasted when it first came out — Hinton’s students literally wrote a paper calling out its label error rate. But so what? It existed, so everyone could step onto the field and play ball (๑•̀ㅂ•́)و✧ DevvMandal tossed 10,000 hours of raw footage on the table. I’m betting that within six months, someone will train a GUI agent from this data that makes everyone’s jaw drop. Not because the dataset is perfect, but because this field has been starving for too long.

There’s a Camera Behind Your Office Desk

What Does 10,000 Hours Even Feel Like

The Sentence Every Office Worker Should Frame on Their Wall

But There’s a Gaping Hole

So Does It Actually Matter? Let Me Just Say It

Related Reading

Related Articles

💬 Comments