Cosmos-H-Surgical: Learning Surgical Robot Policies from Videos via World Modeling

Imagine you want to teach a robot how to perform delicate surgery, like stitching a wound or passing a needle. The biggest problem isn't that the robot is "stupid"; it's that we don't have enough training data.

In the real world, getting a robot to watch a surgeon and copy their hand movements is incredibly hard. It's expensive, dangerous for patients, and requires special permission. It's like trying to learn to drive a Formula 1 car by only being allowed to sit in the driver's seat for 10 minutes a year.

Cosmos-H-Surgical is a new invention that solves this by acting like a super-powered "Imagination Machine" for robots. Here is how it works, broken down into simple steps:

1. The Problem: The "Silent Library"

Think of the internet as a massive library filled with millions of videos of surgeons operating. These are the "Silent Library" books. They are full of visual action (you can see the needle moving), but they are silent regarding the robot's brain. We don't know the exact math of how the surgeon's hand moved (the "kinematics"). Without that math, a robot can't learn to copy the move.

2. The Solution: The "Surgical Storybook" (SATA Dataset)

The researchers created a special book called SATA. They took those silent videos and added detailed captions written by experts.

Instead of just a video of a needle, they wrote: "The left tool gently grabs the needle, moves it to the right, and pokes the tissue at a 45-degree angle."
This turns a silent movie into a storybook that explains exactly what is happening.

3. The Magic Engine: The "Dreamer" (World Model)

Using this storybook, they trained a World Model. Think of this model as a Hollywood Director who has read every surgical storybook ever written.

If you tell this Director, "Show me a robot passing a needle three times," it doesn't just guess; it generates a brand new, photorealistic video of that exact scene.
Because it learned from the storybook, the video looks real, the tools move correctly, and the tissue reacts naturally.

4. The Translator: The "Mind Reader" (Inverse Dynamics Model)

Here is the clever part. The Director (World Model) makes the video, but the robot still doesn't know how to move its arms to make that video happen.

The researchers added a second AI called an Inverse Dynamics Model (IDM). Think of this as a Mind Reader.
The Mind Reader watches the fake video generated by the Director and says, "Ah, for the needle to move that way in the video, the robot's left arm must have moved in this specific mathematical pattern."
It essentially reverse-engineers the robot's movements from the video.

5. The Result: The "Super-Student" Robot

Now, the robot has a massive new library of training data:

Real Data: A tiny bit of real footage from actual surgeries (the "gold standard" but scarce).
Synthetic Data: Thousands of hours of "fake" but perfect videos generated by the Director, with the exact math of how to move, provided by the Mind Reader.

The robot trains on this massive mix. It's like a student who reads a few real textbooks but then spends years practicing in a virtual reality simulator that is so realistic, it feels like the real thing.

Why This Matters

Safety: We don't need to risk patient safety to generate training data.
Speed: We can generate infinite practice scenarios (e.g., "What if the tissue is slippery?" or "What if the needle is bent?") instantly.
Performance: When they tested this on a real robot, the robot trained with this "imagination" data performed much better than robots trained only on the tiny amount of real data available.

In short: Cosmos-H-Surgical teaches robots to learn surgery by letting them watch a movie of the surgery, then using a smart AI to figure out the script (the math) behind the movie, allowing them to practice millions of times without ever touching a real patient.

Here is a detailed technical summary of the paper "Cosmos-H-Surgical: Learning Surgical Robot Policies from Videos via World Modeling."

1. Problem Statement

The primary bottleneck in developing fully autonomous surgical robots is data scarcity. While large-scale Vision-Language-Action (VLA) models have succeeded in household and industrial robotics by leveraging paired video-action datasets, surgical robotics lacks such data due to:

Privacy and Safety Regulations: Collecting synchronized visual observations (endoscopic video) and robot kinematics (control commands) from real surgeries is prohibitively expensive and ethically constrained.
Domain Shift: Existing synthetic physics-based simulators often fail to capture the complex visual dynamics of surgery (e.g., specular tissue surfaces, occlusion, soft-body interactions), leading to poor sim-to-real transfer.
Unlabeled Data: Vast corpora of surgical videos exist, but they lack corresponding action labels, rendering them useless for direct imitation learning or VLA training.

2. Methodology

The authors propose Cosmos-H-Surgical, a unified framework that bridges the gap between unlabeled surgical videos and robot policy learning through a three-stage pipeline:

A. Dataset Curation: Surgical Action–Text Alignment (SATA)

Composition: A large-scale dataset comprising 2,447 expert-annotated video clips (over 300k frames) across 8 surgical procedures.
Content: The data focuses on four fundamental surgical actions: needle grasping, needle puncture, suture pulling, and knotting.
Annotation: Unlike semantic VLM datasets, SATA provides fine-grained text descriptions detailing spatial relationships, anatomical structures, and specific tool-tissue interactions (e.g., "left needle driver punctures the right side...").
Source: Aggregated from credentialed YouTube channels and public datasets (GraSP, SAR-RARP50, etc.).

B. Surgical World Model (Cosmos-H-Surgical)

Base Architecture: Built upon Cosmos-Predict2.5, a state-of-the-art physical AI world model pretrained on diverse robotic and human teleoperation data.
Training Strategy:
- Pretraining: Fine-tuned on the SATA dataset using Low-Rank Adaptation (LoRA) to preserve general video modeling capabilities while specializing for surgical dynamics.
- Conditioning: The model takes an initial frame ( $I_0$ ) and a text prompt to predict future video rollouts ( $\hat{I}_{1:T}$ ).
- Objective: To generate photorealistic, task-consistent surgical videos that capture unique dynamics like instrument-tissue interaction and limited field-of-view motion.

C. Inverse Dynamics Model (IDM) & Policy Learning

Pseudo-Kinematics Generation: Since the generated videos lack ground-truth robot actions, an Inverse Dynamics Model (IDM) is trained to infer "pseudo-kinematics" from pairs of synthetic video frames.
- The IDM predicts the robot actions required to transition between two frames.
Synthetic Dataset Creation: The World Model generates diverse video rollouts, and the IDM labels them with synthetic action data, creating a massive paired video-action dataset.
Policy Training: A GR00T N1.5 VLA model is trained using a combination of:
1. Limited real-world demonstrations (with ground-truth kinematics).
2. The augmented synthetic dataset (video + pseudo-actions).

3. Key Contributions

SATA Dataset: The creation of the first large-scale, expert-annotated surgical video-text corpus specifically designed for physical AI, capturing fine-grained spatial and interaction details.
First Surgical World Model: The development of Cosmos-H-Surgical, the first world model capable of generating high-fidelity, generalizable surgical videos conditioned on text prompts, demonstrating strong temporal coherence and anatomical plausibility.
Synthetic Data Pipeline for Robotics: The first demonstration of connecting surgical world models with robot learning by synthesizing video-action data via inverse dynamics, significantly improving policy performance on real hardware.

4. Experimental Results

The framework was validated on a real surgical robot platform performing needle pickup and hand-over tasks.

World Model Quality:
- Quantitative: Cosmos-H-Surgical achieved the lowest Fréchet Video Distance (FVD: 106.5) and highest VBench consistency scores compared to zero-shot and coarse-category baselines.
- Qualitative: Human expert evaluation (3 surgeons) rated the model highest in text-video alignment, tool consistency, and anatomical structure. It successfully generalized to novel compositions (e.g., multi-step handovers) not explicitly seen during training.
Policy Performance:
- Trajectory Prediction: Policies trained with synthetic data (Real + Synthetic 10x) showed significantly lower Mean Squared Error (MSE) in trajectory prediction compared to policies trained on real data alone.
- Data Efficiency: The synthetic augmentation allowed the model to achieve high performance with very few real demonstrations (as few as 5–20 episodes), outperforming baselines trained only on real data.
- Robustness: The approach demonstrated consistent improvements across different data regimes and model architectures (tested on GR00T and $\pi_0$ ).

5. Significance and Impact

Scalable Autonomy: This work offers a scalable path to surgical autonomy by leveraging the abundance of unlabeled surgical videos, bypassing the need for expensive, large-scale collection of paired kinematic data.
Safety & Regulation: By using generative models to create training data, the approach reduces reliance on risky in-vivo data collection, addressing critical patient safety and regulatory hurdles.
Foundation for Physical AI: It establishes a new paradigm where generative world models serve as a bridge between unstructured video data and structured robot control, potentially applicable to other data-scarce medical domains.

Limitations & Future Work:
The current approach requires fine-tuning for specific robot embodiments (which still needs some real data) and relies on IDM-generated pseudo-kinematics, which may contain residual noise. Future work aims to expand the SATA dataset to cover more complex procedures and refine the IDM for higher precision.