Articulation in Motion: Prior-free Part Mobility Analysis for Articulated Objects By Dynamic-Static Disentanglement

Imagine you have a toy robot, a kitchen drawer, or a folding chair. These are what scientists call "articulated objects." They aren't just one solid block; they are made of different rigid pieces (like a door, a handle, or a drawer) connected by joints (like hinges or sliders) that allow them to move.

For a long time, computers have struggled to understand how these objects move just by looking at them. They usually needed a cheat sheet: "Tell us exactly how many parts there are, and show us the object in two specific poses (start and finish)." If the object opened up to reveal a hidden interior (like a fridge door opening to show the shelves inside), the computer would get confused and break the model.

This paper introduces a new method called AIM (Articulation in Motion). Think of AIM as a detective that doesn't need a cheat sheet. It just watches a video of you playing with the object and figures out the rules of the game all by itself.

Here is how it works, broken down with some everyday analogies:

1. The Problem: The "Before and After" Photo Trap

Old methods were like trying to solve a puzzle by only looking at the first and last photo.

The Flaw: If you open a fridge, the inside shelves were invisible in the "closed" photo but visible in the "open" photo. The computer, trying to match the two photos, gets lost. It doesn't know if the shelves are a new part or if the door is moving.
The Result: The computer often guesses the wrong number of parts or breaks the object into weird, messy pieces.

2. The Solution: Watching the Movie (AIM)

Instead of looking at two static photos, AIM watches a video of the object moving. It's like watching a movie instead of looking at a flipbook. This allows the computer to see the story of the movement, not just the start and end points.

3. The Secret Sauce: The "Dual-Gaussian" Dance

The paper uses a technology called 3D Gaussian Splatting. Imagine the object is made of millions of tiny, glowing, fuzzy balls (Gaussians) floating in space.

The Old Way: In previous methods, every ball was told to move a little bit to match the video. This created a lot of "noise." Static parts (like the fridge body) would wiggle slightly, confusing the computer about what is actually moving.
The AIM Way (Dual-Gaussian): AIM splits the balls into two teams:
1. The Static Team: These balls stay perfectly still. They form the "base" of the object.
2. The Dynamic Team: These balls are the "actors." They are the only ones allowed to move.
The Analogy: Imagine a stage play. The set (walls, floor) is painted on a static backdrop. The actors (the moving parts) walk around.
- Old methods tried to paint the actors onto the backdrop, making it look like the walls were stretching and wiggling.
- AIM keeps the backdrop frozen and only lets the actors move. This makes it incredibly easy to see who is actually moving and who is just standing there.

4. The "Magic Filter": Finding the Hidden Parts

Sometimes, when you open a drawer, you see the inside of the cabinet for the first time. In the video, these new parts look like they are "moving" because they are appearing.

The SDMD Module: AIM has a special filter (called Static-During-Motion Detection). It watches the "actors." If a group of balls starts moving but then suddenly stops and stays still (like the inside of a drawer once it's fully open), the filter says, "Ah, you aren't an actor; you're part of the set!" It moves those balls from the "Dynamic Team" to the "Static Team."
Why it matters: This prevents the computer from thinking the inside of the fridge is a separate moving part.

5. The Detective Work: RANSAC

Once AIM has separated the "actors" from the "set," it needs to figure out how the actors are connected.

The Problem: How do you know if a group of balls is a door swinging on a hinge, or a drawer sliding on a track?
The Solution: AIM uses a mathematical technique called Sequential RANSAC.
- The Analogy: Imagine you are at a party and you see groups of people dancing. You don't know who is dancing with whom. You pick a few people, guess a dance move (e.g., "they are all spinning around a pole"), and see if everyone else in that group fits that move. If they do, you've found a dance group! If not, you try a different guess.
- AIM does this automatically. It looks at the paths the "actors" took and groups them into rigid parts. It figures out: "These balls are all swinging around this specific hinge," or "These balls are all sliding in this specific direction."
- Best of all: It doesn't need to know how many groups there are beforehand. It just finds them naturally.

Why is this a Big Deal?

No Cheat Sheets: You don't need to tell the computer, "This object has 3 parts." It figures it out.
Handles the Unknown: It works even when the object opens up to reveal hidden insides (like a fridge or a safe).
Real-World Ready: It works with simple videos taken by a phone or a camera, not just expensive lab equipment.

In summary:
Previous computers tried to solve a 3D puzzle by comparing two snapshots and often got stuck when the picture changed too much. AIM watches a video, separates the "moving actors" from the "static stage" with surgical precision, and then uses a smart detective algorithm to figure out exactly how the actors are connected. It turns a messy, confusing video into a clean, interactive 3D model that understands how the object works.

1. Problem Statement

Articulated objects (e.g., doors, drawers, scissors) are ubiquitous in daily life and critical for robotics, mixed reality, and embodied AI. While recent advances in neural 3D representations (like 3D Gaussian Splatting) enable high-fidelity object reconstruction, analyzing part-level mobility and articulation remains challenging.

Limitations of Existing Methods:

Two-State Dependency: Most state-of-the-art methods (e.g., DTA, ArtGS) rely on comparing two distinct states (Start and End) to establish geometric correspondences.
Prior Knowledge Requirement: These methods typically require the number of parts to be known in advance.
Correspondence Failure: They struggle when the "End" state reveals regions absent in the "Start" state (e.g., the interior of a refrigerator or oven). This breaks cross-state correspondence, leading to unstable optimization, incorrect segmentation, and failure to recover the correct number of parts.
Static-Dynamic Confusion: Existing dynamic 3DGS methods often assign deformation fields to all Gaussians, including static ones, introducing noise that confuses part-level structure analysis.

2. Methodology: Articulation in Motion (AIM)

The authors propose AIM, a framework that reconstructs geometry, segments parts, and estimates kinematics using a single interaction video and an initial static 3D scan. It operates without prior knowledge of the number of parts or joint types.

The pipeline consists of three stages:

Stage I: Initial Static Modeling

A standard 3D Gaussian Splatting (3DGS) model is trained on a multi-view RGB scan of the object in its initial (closed/static) state. This creates the initial Gaussian set $\{GS\}$ .

Stage II: Dual-Gaussian Scene Representation & Dynamic-Static Disentanglement

To handle continuous motion and newly revealed static regions, AIM introduces a Dual-Gaussian representation:

Static Base ( $\{GS_p\}$ ): Represents the static parts of the object.
Moving Gaussians ( $\{GM, t\}$ ): A deformable 3DGS set that tracks moving parts over time using a learned deformation field (MLP-based).
Joint Optimization:
- The two sets are jointly optimized against the interaction video.
- Pruning: Moving elements are progressively pruned from the static set $\{GS\}$ as their opacity decreases, refining the static base $\{GS_p\}$ .
- Static-During-Motion Detection (SDMD): A critical module that identifies regions that are static during motion (e.g., the interior of a fridge revealed as the door opens). It analyzes trajectories of the moving Gaussians; if a group exhibits negligible motion (below a threshold), it is reclassified as static and reassigned to $\{GS_p\}$ . This prevents "static leakage" into the moving set.

Stage III: Prior-Free Part Mobility Analysis

Once the dynamic and static components are disentangled, AIM analyzes the trajectories of the moving Gaussians to determine part structure:

Sequential RANSAC: Instead of clustering based on pre-defined part counts, AIM uses a robust, optimization-free Sequential Random Sample Consensus (RANSAC) algorithm with a Kabsch solver.
Process:
1. It samples minimal sets of moving Gaussians across time windows (e.g., $t=0 \to 0.5$ and $t=0 \to 1$ ).
2. It estimates rigid transformations (rotation/translation) for these sets.
3. It identifies inliers (Gaussians fitting the rigid motion model) and groups them into a single rigid part.
4. The process repeats iteratively on remaining Gaussians until all moving parts are clustered.
Kinematic Estimation: For each identified rigid part, the system calculates articulation parameters:
- Joint Type: Revolute (hinge) or Prismatic (sliding), determined by rotation angle thresholds.
- Parameters: Joint axis position, direction, rotation angle ( $\Theta$ ), and translation distance ( $\Phi$ ).

3. Key Contributions

Prior-Free Framework: AIM is the first method to achieve robust part segmentation and articulation estimation without requiring the number of parts or joint types as input.
Dual-Gaussian Representation: A novel scene representation that explicitly disentangles static and dynamic components, allowing for clean trajectory extraction and handling of newly revealed static regions via the SDMD module.
Sequential RANSAC for Mobility: A novel application of Sequential RANSAC to cluster moving primitives into rigid parts based purely on motion cues, eliminating the need for structural priors or unstable cross-state optimization.
Video-Based Input: Shifts the paradigm from two-state comparisons to continuous interaction videos, better aligning with human interaction and avoiding correspondence failures in "open-end" scenarios.

4. Experimental Results

The method was evaluated on the PartNet-Mobility dataset (including simple 2-part, 3-part, and complex multi-part objects) and real-world data captured via Meta Project Aria glasses.

Part Segmentation: AIM significantly outperforms state-of-the-art two-state methods (DTA, ArtGS) in 3D Intersection-over-Union (IoU).
- On complex objects (e.g., Storage with 6 moving parts), AIM improved mean dynamic-part IoU by +27.11% over the previous SOTA.
- It successfully handles "closed-start to open-end" scenarios where other methods fail due to missing geometric correspondences.
Reconstruction Quality: While using only RGB inputs, AIM achieves competitive Chamfer Distance (CD) on static parts and superior accuracy on dynamic parts compared to methods relying on depth or two-state inputs.
Articulation Estimation:
- Axis/Angle Error: Reduced significantly (e.g., from $12.78^\circ$ to $0.58^\circ$ on complex storage objects).
- Motion Error: Near-zero error for prismatic joints ($0.02$ units).
- Robustness: Unlike baselines that often fail (denoted as "F" or "WT" for wrong type) on complex objects, AIM consistently predicts correct joint types and axes.
Real-World Validation: Demonstrated robust performance on real-world videos with occlusions and varying lighting, correctly identifying joint axes and motion ranges (e.g., predicting an oven door opening of $\approx 82^\circ$ vs. ground truth $85^\circ$ ).

5. Significance

Paradigm Shift: Moves away from rigid, two-state geometric correspondence towards motion-driven analysis, which is more robust to occlusions and partial visibility.
Practicality: Removes the need for manual annotation of part counts or joint types, making it applicable to real-world, unknown articulated objects.
Foundation for Interaction: By producing high-quality, part-level 3D replicas with accurate kinematics, AIM enables more realistic simulation, robotics manipulation, and interactive digital twins for complex articulated objects.

In summary, AIM solves the critical bottleneck of articulation analysis in unknown environments by leveraging continuous motion cues and a novel dual-Gaussian disentanglement strategy, achieving state-of-the-art performance without structural priors.