sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only

Imagine you are holding a smartphone and walking around a complex object, like a folding chair, a laptop, or a pair of eyeglasses, filming it as you move. The camera spins, tilts, and zooms. The object itself might be opening, closing, or rotating.

The Problem:
Trying to teach a computer to understand how that object moves and which parts are connected is incredibly hard. It's like trying to figure out the blueprint of a moving machine just by watching a shaky, blurry video of it. Previous methods were like trying to solve a puzzle by tracking every single grain of sand on the object for the entire video. If the camera moved too fast or the object was hidden behind something, the "sand" got lost, and the whole puzzle fell apart. They also often needed expensive, multi-camera setups or perfect 3D scans to work, which isn't practical for everyday use.

The Solution: sim2art
The authors introduce sim2art, a new AI method that acts like a "digital twin" creator. It can take a single, casual video (like one you'd take on your phone) and instantly figure out:

Which parts move: (e.g., the laptop screen vs. the keyboard).
How they connect: (e.g., the hinge is here, the axis is there).
How they move: (e.g., the screen rotates 90 degrees).

Here is how it works, using some simple analogies:

1. The "Snapshot" Strategy (No Long-Term Tracking)

Imagine you are trying to understand how a dancer moves.

Old Way: You try to follow every single freckle on the dancer's skin from the start of the song to the end. If the dancer spins too fast or a curtain blocks the view, you lose the freckle, and you get confused.
sim2art Way: Instead of following freckles, you just take a quick photo of the dancer's pose right now. You look at the shape of the body in that single frame. Then, you look at the next frame and do the same. By comparing these "snapshots" of the surface, the AI understands the movement without needing to keep a perfect record of every single point over time. It's robust because it doesn't care if a point disappears for a second; it just looks at the next available point.

2. The "Video Game Training" (Synthetic Data Only)

Usually, to teach a robot to understand the real world, you need to show it thousands of real-world examples (like showing a child a million real chairs). This is slow and expensive.

sim2art's Trick: The team built a "video game" (a simulation) where they generated thousands of fake videos of moving objects. They trained the AI entirely inside this game.
The Magic: Because the AI learned to look at the surface of the object rather than complex, long-term tracking, it didn't notice the difference between the fake game world and the real world. It's like a pilot training in a flight simulator; the physics are so accurate that when they get in a real plane, they know exactly what to do without needing extra practice. This means sim2art works on real videos immediately, without needing to be retrained on real data.

3. The "Super-Brain" (Transformer Architecture)

The AI uses a special type of neural network called a Transformer (the same technology behind advanced chatbots).

Think of the video as a conversation. The AI looks at all the points on the object at once and asks, "Hey, this point on the laptop screen is moving differently than this point on the keyboard. They must be connected by a hinge!"
It also uses "semantic features" (like recognizing that a specific texture looks like a screen) and "scene flow" (a quick sense of how things are moving between frames) to make its guesses even smarter.

Why This Matters

Robustness: It works even when the camera is shaky, the object is partially hidden, or the lighting is bad.
Versatility: It can handle objects with many moving parts (like a filing cabinet with 5 drawers), not just simple two-part objects.
Future Applications: Once the AI understands the "skeleton" and "joints" of an object, it can create a perfect 3D digital twin. This is huge for:
- Robotics: A robot can look at a real chair, understand how the legs fold, and pick it up without breaking it.
- Digital Twins: You could film your messy desk, and the computer could build a perfect, interactive 3D model of it for a video game or VR.
- Augmented Reality: You could point your phone at a real cabinet, and the app could show you exactly how to open the drawers or where the hinges are.

In a Nutshell:
sim2art is like giving a computer a pair of "X-ray glasses" that can look at a shaky video of a moving object and instantly draw the blueprint of its moving parts, all by learning from a video game instead of needing a million real-world examples. It turns a messy, casual video into a precise, interactive 3D model.

1. Problem Statement

The paper addresses the challenge of articulated object modeling from a single, casually captured monocular video. The goal is to jointly estimate:

3D Part Segmentation: Identifying which points belong to which moving part of the object.
Joint Parameters: Determining the type of joint (revolute, prismatic, or static), the axis of rotation/translation, the pivot point, and the motion magnitude (rotation angle or translation distance) for each time step.

Key Challenges:

Significant Camera Ego-motion: Unlike controlled lab setups, casual videos involve the camera moving freely around the object, causing drastic appearance changes, occlusions, and parts appearing/disappearing.
Reliance on Long-term Tracking: Existing methods often depend on long-term point tracks, which are fragile and prone to failure under occlusion or rapid motion.
Data Scarcity: Annotating real-world articulated objects with ground-truth joints is prohibitively expensive and time-consuming.

2. Methodology: sim2art

The proposed framework, sim2art, is a data-driven approach that trains exclusively on synthetic data but generalizes robustly to real-world scenarios without domain adaptation.

A. Input Representation

Instead of relying on fragile long-term point tracks, sim2art uses a robust per-frame surface point sampling strategy:

Point Sampling: Randomly samples $N_p$ (2048) points on the object's surface for each frame using the object mask and depth map.
Feature Augmentation: Each point is augmented with:
- Short-term Scene Flow: 3D translation between consecutive frames ( $t$ and $t+1$ ), obtained via tracking over a single step (avoiding long-term error accumulation).
- Semantic Features: DINOv3 features extracted from the RGB image and projected to the 3D points.
- Temporal Encoding: Sinusoidal positional encoding for the frame index.
Robustness: This representation relies only on single-viewpoint visibility, ensuring consistency between synthetic and real data despite noise or occlusions.

B. Network Architecture

The model utilizes a Transformer-based architecture inspired by point cloud video processing:

Encoder:
- Subsamples keypoints using Farthest Point Sampling (FPS).
- Computes spatio-temporal features by aggregating information from neighbors in space and time.
- Key Innovation: Explicitly injects scene flow ( $\bar{v}$ ) and DINOv3 features ( $\bar{\phi}$ ) into the feature aggregation, along with normalized time information.
Decoder:
- Applies Self-Attention mechanisms to the encoded features to capture global context across the video sequence.
- Propagates features from keypoints back to the original 3D points.
Prediction Heads:
- Part Segmentation: Uses learnable queries (Hungarian matching) to predict part labels for each point.
- Joint Parameters: Aggregates per-point features into per-part features to predict:
  - Joint Type (Revolute, Prismatic, Static).
  - Axis parameters (Rotation axis vector, Pivot point, or Translation axis).
- Motion Amount: Predicts the specific rotation angle or translation distance for each part at each timestep.

C. Training Strategy

Synthetic-Only Training: The model is trained entirely on the 4art-synth dataset (501 objects, 14 categories) rendered in a PyBullet environment with random camera trajectories.
No Domain Adaptation: The representation design ensures the gap between synthetic and real data is negligible, allowing direct application to real videos.
Inference: For real-world videos, the system uses off-the-shelf tools (ViPE for depth/camera pose, SAM2 for masks) to generate the input point clouds.

3. Key Contributions

Robust Representation: Introduction of a per-frame point sampling strategy augmented with short-term flow and semantic features, eliminating the need for error-prone long-term tracking.
Synthetic-to-Real Generalization: A framework that achieves state-of-the-art performance on real-world data using only synthetic training data, bypassing the need for costly real-world annotations.
New Datasets:
- 4art-synth: A large-scale synthetic dataset with 501 instances across 14 categories.
- 4art-real: A challenging real-world dataset featuring diverse objects (box, laptop, stapler, eyeglasses, cabinet) with significant camera motion and human interaction.
Superior Performance: Outperforms existing optimization-based and tracking-dependent methods in handling large camera motions and complex articulations.

4. Experimental Results

The authors evaluated sim2art against state-of-the-art methods (GAMMA, Reart, Video2Articulation, Articulate-Anything, Artipoint) and a custom baseline (FeatClust).

Synthetic Data (4art-synth):
- mIoU: sim2art achieved 0.89 mean mIoU, significantly outperforming Reart (0.71) and GAMMA (0.35).
- Joint Accuracy: Achieved 97.32% accuracy in joint type classification and sub-degree errors in axis angle prediction (5.06° vs. 35.49° for Reart).
- Robustness: Successfully handled categories where other methods failed completely (marked as 'F' in tables).
Real-World Data (4art-real):
- mIoU: Achieved 0.83 mean mIoU, compared to 0.28 for FeatClust and 0.14 for Reart.
- Motion Accuracy: Demonstrated significantly lower errors in axis position (2.98 cm vs. 18.02 cm for Reart) and part rotation.
- Qualitative Results: Visualizations show sim2art correctly segments and reconstructs complex objects like eyeglasses and staplers under large camera movements, whereas competitors often over-segment or fail to track parts.
Ablation Study:
- Removing Scene Flow caused a significant drop in joint parameter accuracy.
- Removing DINOv3 features or Temporal Encoding also degraded performance, confirming the necessity of semantic and temporal cues.

5. Significance and Impact

Scalability: By relying solely on synthetic data, sim2art offers a scalable solution. New object categories can be added by simply rendering new synthetic data, avoiding the bottleneck of manual real-world annotation.
Practicality: The method works with standard monocular videos and off-the-shelf depth estimation tools, making it accessible for robotics, digital twin creation, and AR/VR applications.
Paradigm Shift: It challenges the reliance on long-term tracking and multi-view setups, proving that short-term cues combined with powerful semantic features and synthetic training are sufficient for high-fidelity articulated reconstruction.

In conclusion, sim2art sets a new state-of-the-art for articulated object reconstruction from casual videos, offering a robust, accurate, and scalable solution that bridges the sim-to-real gap effectively.