MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment

Imagine you are trying to teach a computer to understand human movement. You have two very different sources of information:

The Video Camera: It sees the whole picture, like a movie. It knows what is happening (a person is dancing), but it can get confused if the person is blocked by someone else, if the lighting is bad, or if the camera angle changes. It's like watching a play from the back of a theater; you see the actors, but you can't hear their breathing or feel their heartbeat.
The IMU Sensors: These are tiny motion trackers (like smartwatches or fitness bands) strapped to a person's limbs. They are incredibly precise about how the body is moving, measuring acceleration and rotation thousands of times a second. But they are "blind." They don't know if the person is dancing in a living room or running in a park. They just feel the movement.

The Problem:
Right now, these two technologies don't talk to each other well. If you try to match a video clip with a sensor reading, the computer often gets lost. It might think two different dances are the same because they look similar from a distance, or it might fail to sync them up perfectly because the video lags slightly behind the sensor data.

The Solution: MoBind (Motion Binding)
The authors of this paper created a new system called MoBind. Think of MoBind as a super-smart translator and matchmaker that forces the video and the sensors to understand each other on a very deep level.

Here is how it works, using some everyday analogies:

1. Ignoring the Background Noise (The "Focus" Filter)

Imagine you are trying to match a recording of a drummer's hands to a video of a band. If you look at the whole video, the computer gets distracted by the guitarist, the singer, and the crowd.

MoBind's Trick: Instead of looking at the whole video (the raw pixels), MoBind looks only at the skeleton (the stick-figure outline of the person). It ignores the background, the clothes, and the scenery. It focuses purely on the "dance moves" themselves, just like the sensors do.

2. The "Body Part" Matchmaking (The Local Connection)

Imagine you have a group of dancers, and you want to know which dancer is wearing a sensor on their left wrist.

Old Way: The computer looks at the whole person and guesses. It's like trying to find a specific person in a crowd by looking at their whole outfit.
MoBind's Trick: MoBind breaks the body down into parts. It says, "Okay, let's look only at the left wrist in the video and match it only to the sensor on the left wrist." It does this for the right leg, the head, the torso, etc.
The Analogy: It's like a puzzle solver that doesn't try to match the whole picture at once. Instead, it matches the "left arm piece" of the video to the "left arm piece" of the sensor data. This makes the connection much stronger and more accurate.

3. The "Micro-Sync" (The Sub-Second Clock)

Sometimes, the video and the sensor are off by a tiny fraction of a second (like a drummer and a singer who are slightly out of sync).

The Challenge: If you just look at the whole song, you might not notice the tiny lag. But if you want to know exactly when the drummer hit the snare, you need to look at the split second.
MoBind's Trick: MoBind uses a hierarchical approach.
- Level 1 (The Microscope): It aligns tiny chunks of time (milliseconds) between the sensor and the specific body part.
- Level 2 (The Telescope): It then steps back and looks at the whole body moving together to make sure the overall "vibe" matches.
- The Result: It can sync the video and sensor data with sub-second precision. It's like a conductor who can hear a drummer is 0.1 seconds late and instantly corrects the orchestra.

4. The "Fill-in-the-Blanks" Game (The Memory Helper)

There's a risk that if you focus too much on tiny details (like the exact millisecond a foot hits the ground), the computer might forget the big picture (e.g., "This is a dance, not a fight").

MoBind's Trick: They added a game called Masked Token Prediction. Imagine reading a sentence where some words are hidden, and you have to guess them based on the context.
The Analogy: MoBind hides some of the sensor data and forces itself to guess what was there. This forces the computer to remember the meaning of the action (e.g., "This is a jump") while still learning the precise timing. It keeps the "big picture" in mind while doing the "micro-details."

Why Does This Matter? (The Real-World Magic)

Because MoBind is so good at connecting these two worlds, it can do four amazing things:

Automatic Syncing: You can record a video and wear sensors without needing to press a button at the exact same time. MoBind figures out the timing automatically, like a DJ matching two different songs.
Privacy-Friendly Search: You can search for a video using a sensor recording (e.g., "Show me all the times I did a squat") without needing to look at the video first. This is great for privacy because you don't need to store or share the video to find the data.
Finding the Right Person: In a room with five people, MoBind can tell you exactly which person is wearing the sensor on their left ankle. It's like a detective who can identify a suspect just by the way they walk.
Better Action Recognition: It helps computers understand human movement better for things like sports analysis, rehabilitation (checking if a patient is doing exercises correctly), and gaming.

In Summary:
MoBind is like a universal translator that teaches a blind sensor and a distracted camera to speak the same language. By focusing on the skeleton, matching body parts individually, and syncing time down to the millisecond, it creates a perfect bridge between how we move and how we are seen.

1. Problem Statement

The paper addresses the challenge of learning a joint representation between Inertial Measurement Unit (IMU) signals and 2D pose sequences extracted from video. While existing methods can align modalities at a coarse, action-category level (e.g., distinguishing "walking" from "running"), they fail to achieve fine-grained, sub-second temporal alignment.

Current limitations in the field include:

Lack of Temporal Precision: Most contrastive learning approaches compress entire clips into a single global embedding. This collapses fine-grained temporal structures (e.g., phase shifts, short lags, or repetition boundaries), making sub-second synchronization impossible.
Irrelevant Visual Noise: Aligning raw video pixels with IMU data introduces noise from background elements that are irrelevant to the specific motion captured by the sensor.
Multi-Sensor Complexity: IMUs are often deployed in multi-sensor configurations (attached to different body parts). Naively concatenating these signals fails to capture the spatial and temporal specificity of each sensor's location.
Ambiguity in Repetitive Motion: Human motion often involves highly repetitive patterns (e.g., walking cycles), creating ambiguous synchronization cues that standard audio-visual synchronization techniques struggle to resolve.

2. Methodology: MoBind Framework

The authors propose MoBind, a hierarchical contrastive learning framework designed to align IMU signals with skeletal motion sequences rather than raw pixels. The architecture consists of three core components:

A. Modality-Specific Encoders

IMU Module: Processes $N$ IMU streams. Each stream passes through 1D convolutional blocks followed by a Transformer layer to generate a sequence of temporal tokens.
Pose Module: Extracts 2D skeletal joint coordinates from video. Crucially, the full-body skeleton is decomposed into local body-part segments corresponding to the IMU sensor locations. These segments are encoded using the same architecture as the IMU module.
Token Alignment: Both modalities are processed to produce sequences of $T$ temporal tokens, ensuring temporal resolution is preserved.

B. Hierarchical Contrastive Alignment

MoBind employs a three-level contrastive objective to enforce alignment at different granularities:

Token-Level: Aligns individual temporal tokens ( $Z_t$ ) across modalities to capture sub-second synchrony.
Local-Level: Aligns the representation of a specific IMU sensor with the motion of its corresponding body part ( $Z_n^{imu}$ vs. $Z_n^{part}$ ). This handles the multi-sensor configuration explicitly.
Global-Level: Aggregates local representations (via concatenation and an MLP aggregator) to align full-body IMU representations ( $G^{imu}$ ) with full-body pose representations ( $G^{part}$ ).

The loss function combines these levels using weighted coefficients ( $\lambda_g, \lambda_l, \lambda_t$ ) based on the InfoNCE loss.

C. Masked Token Prediction (MTP)

To prevent the model from over-focusing on fine-grained alignment at the expense of high-level semantic understanding (crucial for action recognition), the authors introduce an auxiliary Masked Token Prediction (MTP) task.

Mechanism: Random tokens in the IMU sequence are masked and replaced with a learnable query vector. A lightweight Transformer decoder predicts the missing tokens using the surrounding context.
Goal: This acts as a regularizer, forcing the embeddings to retain coarse-grained action semantics while still learning fine-grained dynamics.

3. Key Contributions

Fine-Grained Temporal Alignment: MoBind is the first framework to achieve sub-second synchronization between IMUs and video by modeling token-level dynamics rather than just global clip-level features.
Structured Multi-Sensor Modeling: By decomposing the skeleton into local body parts and aligning them with specific IMU sensors, the framework naturally handles multi-sensor setups and enables body-part localization (identifying where on the body a sensor is worn).
Motion-Centric Representation: The shift from raw pixels to skeletal motion sequences effectively filters out irrelevant visual background, improving robustness in complex scenes.
Unified Framework: The model simultaneously supports four downstream tasks: cross-modal retrieval, temporal synchronization, subject/body-part localization, and human action recognition.

4. Experimental Results

The method was evaluated on three datasets: mRi (rehabilitation), TotalCapture (dynamic motion), and EgoHumans (multi-person scenes).

Cross-Modal Retrieval: MoBind significantly outperforms baselines (IMU2CLIP, DeSPITE, SyncNet) in both IMU→Video and Video→IMU directions. On mRi, it achieves 94% Recall@1 (vs. 77% for SyncNet), demonstrating superior instance-level alignment.
Temporal Synchronization:
- Achieves a Mean Absolute Error (MAE) of 0.47s on mRi and 0.05s on TotalCapture.
- On EgoHumans, synchronization error remains below 50ms for all action categories.
- It successfully aligns sequences with random offsets up to ±7 seconds.
Subject & Body-Part Localization:
- Achieves 98.12% accuracy in identifying the correct person in multi-person scenes (EgoHumans).
- Successfully identifies the specific body part wearing the sensor (e.g., left wrist vs. right ankle) with high accuracy.
Human Action Recognition (HAR):
- MoBind achieves state-of-the-art performance in HAR (98% accuracy on mRi), confirming that the MTP task successfully preserves action-level semantics.
Robustness: The model remains effective even when up to 50% of IMU sensors are dropped, demonstrating resilience for real-world deployment where sensors may fail.

5. Significance and Impact

MoBind represents a significant advancement in multimodal human motion analysis. Its ability to perform calibration-free temporal synchronization removes the need for cumbersome hardware triggers or manual alignment, making multimodal data collection more accessible.

Furthermore, by solving the spatial association problem (linking a sensor to a specific person and body part), MoBind enables robust tracking in multi-person environments and under occlusion. The framework's success in balancing fine-grained temporal dynamics with coarse semantic consistency suggests a new paradigm for cross-modal learning that could be applied to other sensor-vision pairs beyond IMUs.

Code Availability: The code is open-sourced at https://github.com/bbvisual/MoBind.