Cosmos-H-Surgical: Learning Surgical Robot Policies from Videos via World Modeling

This paper addresses the scarcity of labeled surgical robot data by introducing Cosmos-H-Surgical, a world model that generates realistic surgical videos and infers synthetic kinematics via an inverse dynamics model, enabling the training of superior surgical policies that outperform those trained solely on limited real-world demonstrations.

Yufan He, Pengfei Guo, Mengya Xu, Zhaoshuo Li, Andriy Myronenko, Dillan Imans, Bingjie Liu, Dongren Yang, Mingxue Gu, Yongnan Ji, Yueming Jin, Ren Zhao, Baiyong Shen, Daguang Xu

Published Thu, 12 Ma
📖 4 min read☕ Coffee break read

Imagine you want to teach a robot how to perform delicate surgery, like stitching a wound or passing a needle. The biggest problem isn't that the robot is "stupid"; it's that we don't have enough training data.

In the real world, getting a robot to watch a surgeon and copy their hand movements is incredibly hard. It's expensive, dangerous for patients, and requires special permission. It's like trying to learn to drive a Formula 1 car by only being allowed to sit in the driver's seat for 10 minutes a year.

Cosmos-H-Surgical is a new invention that solves this by acting like a super-powered "Imagination Machine" for robots. Here is how it works, broken down into simple steps:

1. The Problem: The "Silent Library"

Think of the internet as a massive library filled with millions of videos of surgeons operating. These are the "Silent Library" books. They are full of visual action (you can see the needle moving), but they are silent regarding the robot's brain. We don't know the exact math of how the surgeon's hand moved (the "kinematics"). Without that math, a robot can't learn to copy the move.

2. The Solution: The "Surgical Storybook" (SATA Dataset)

The researchers created a special book called SATA. They took those silent videos and added detailed captions written by experts.

  • Instead of just a video of a needle, they wrote: "The left tool gently grabs the needle, moves it to the right, and pokes the tissue at a 45-degree angle."
  • This turns a silent movie into a storybook that explains exactly what is happening.

3. The Magic Engine: The "Dreamer" (World Model)

Using this storybook, they trained a World Model. Think of this model as a Hollywood Director who has read every surgical storybook ever written.

  • If you tell this Director, "Show me a robot passing a needle three times," it doesn't just guess; it generates a brand new, photorealistic video of that exact scene.
  • Because it learned from the storybook, the video looks real, the tools move correctly, and the tissue reacts naturally.

4. The Translator: The "Mind Reader" (Inverse Dynamics Model)

Here is the clever part. The Director (World Model) makes the video, but the robot still doesn't know how to move its arms to make that video happen.

  • The researchers added a second AI called an Inverse Dynamics Model (IDM). Think of this as a Mind Reader.
  • The Mind Reader watches the fake video generated by the Director and says, "Ah, for the needle to move that way in the video, the robot's left arm must have moved in this specific mathematical pattern."
  • It essentially reverse-engineers the robot's movements from the video.

5. The Result: The "Super-Student" Robot

Now, the robot has a massive new library of training data:

  1. Real Data: A tiny bit of real footage from actual surgeries (the "gold standard" but scarce).
  2. Synthetic Data: Thousands of hours of "fake" but perfect videos generated by the Director, with the exact math of how to move, provided by the Mind Reader.

The robot trains on this massive mix. It's like a student who reads a few real textbooks but then spends years practicing in a virtual reality simulator that is so realistic, it feels like the real thing.

Why This Matters

  • Safety: We don't need to risk patient safety to generate training data.
  • Speed: We can generate infinite practice scenarios (e.g., "What if the tissue is slippery?" or "What if the needle is bent?") instantly.
  • Performance: When they tested this on a real robot, the robot trained with this "imagination" data performed much better than robots trained only on the tiny amount of real data available.

In short: Cosmos-H-Surgical teaches robots to learn surgery by letting them watch a movie of the surgery, then using a smart AI to figure out the script (the math) behind the movie, allowing them to practice millions of times without ever touching a real patient.