ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

Imagine you are trying to understand a 2-hour movie, but you only have a tiny amount of time and energy to read the script.

Most current AI models try to solve this by reading every single word of the script, one by one. If the movie is 2 hours long, that's a massive script. The AI gets overwhelmed, slows down, and starts missing the plot because it's too busy staring at the same background scenery for too long.

ReMoRa is a new, smarter AI that changes the game. Instead of reading every single word, it reads a condensed, high-speed summary that captures the action without the fluff.

Here is how it works, using simple analogies:

1. The Problem: The "Redundant Script"

Think of a video like a flipbook. If you flip through 1,000 pages where a character is just standing in a room, 990 of those pages look almost exactly the same.

Old AI: Reads all 1,000 pages. It wastes energy on the 990 boring pages and might miss the one page where the character actually throws a ball.
The Bottleneck: Reading that many pages takes too much computer power (memory and time), especially for long videos.

2. The Solution: The "Director's Cut" (Compressed Video)

ReMoRa doesn't read the whole flipbook. It looks at the video file the way a video compression software (like the one that makes your Netflix stream work) sees it.

I-Frames (The Keyframes): These are the "full pictures." ReMoRa grabs a few high-quality snapshots of the scene (like a photo of the room).
Motion Vectors (The Arrows): Instead of drawing the next 50 pages, the software just draws arrows showing how things moved from the last picture to the next. "The cat moved 2 inches left," "The cup tilted right."
ReMoRa's Superpower: It reads the photos for what things look like, and the arrows for what things did. It skips the boring parts entirely.

3. The "Noise" Problem: The "Fuzzy Arrows"

Here's the catch: Those "arrows" (motion vectors) generated by video compression are often messy. They are blocky, blurry, and sometimes point in the wrong direction because the computer tried to save space.

Analogy: Imagine someone giving you directions by pointing with a shaky hand and saying, "Go that way... maybe?" It's hard to follow.

4. The Magic Fix: The "Refiner" (RMR Module)

ReMoRa has a special "Refiner" module. Think of this as a smart editor or a restoration artist.

When ReMoRa gets those shaky, blocky arrows, the Refiner cleans them up.
It turns the "shaky, fuzzy directions" into a smooth, high-definition flow. It fills in the gaps so the AI can see exactly how a person's hand moved or how a ball bounced, even though it never saw the actual video frames in between.

5. The "Long Memory" Trick: The "State Space" (HMSS Module)

Even with the cleaned-up arrows, a 2-hour movie is still a long list of instructions. Standard AI gets confused if the list is too long (it forgets the beginning by the time it reaches the end).

The Old Way: Trying to remember every single step of a long journey at once.
ReMoRa's Way: It uses a State Space Model (think of it as a "smart summary notebook"). Instead of remembering every single step, it updates a running summary.
- Step 1: "We are in the kitchen." -> Notebook updates.
- Step 2: "We walk to the living room." -> Notebook updates.
- It keeps the story flowing in a straight line without getting bogged down by the sheer number of pages. This allows it to handle hours of video without running out of memory.

Why Does This Matter?

Because ReMoRa is so efficient, it can:

Watch longer videos (hours instead of minutes).
Spot tiny details (like a subtle hand gesture or a quick glance) that other AIs miss because they were too busy processing the background.
Work faster and cheaper because it doesn't need to "decode" every single frame of the video.

In a nutshell:
If watching a video is like reading a book, other AIs try to read every letter. ReMoRa reads the chapter titles and the highlighted action sentences, then uses a smart editor to fill in the missing details, allowing it to understand the whole story quickly and accurately.

1. Problem Statement

Multimodal Large Language Models (MLLMs) have achieved significant success in image-text tasks but struggle with long-form video understanding (videos spanning minutes to hours). The primary bottlenecks are:

Computational Intractability: Processing full streams of RGB frames incurs quadratic complexity ( $O(N^2)$ ) with respect to sequence length due to self-attention mechanisms, making long videos computationally prohibitive.
Redundancy: Uniform frame sampling often captures redundant static backgrounds, while sparse sampling misses critical short-duration events.
Information Loss: Existing compression strategies (e.g., token pooling) often blur fine-grained motion cues and temporal dynamics, leading to poor performance on tasks requiring precise temporal reasoning.

The authors argue that current models fail because they rely on decoding redundant RGB frames rather than leveraging the intrinsic, compact representations already present in compressed video streams.

2. Methodology: ReMoRa

ReMoRa is a video MLLM designed to operate directly on compressed video representations (specifically H.264/HEVC structures) rather than decoded RGB frames. The architecture consists of four main components:

A. Compressed Video Representation Input

Instead of uniform sampling, ReMoRa decomposes video into Groups of Pictures (GOPs):

I-frames (Intra-coded): Treated as high-information keyframes providing appearance anchors.
P/B-frames (Predictive/Bi-predictive): Represented solely by motion vectors (block-based displacement data) and residuals, avoiding full frame decoding.
Input Structure: The model processes tuples of $(I\text{-frame}, \text{Motion Vectors}_{1..T})$ , where $T$ is the number of inter-frames in a GOP.

B. Refined Motion Representation (RMR) Module

Raw motion vectors from codecs are inherently noisy, block-based, and sparse, often lacking the fidelity of dense optical flow.

Function: The RMR module acts as a feature encoder that denoises and densifies these coarse vectors.
Pretraining: It is pretrained to map block-level motion vectors to dense optical flow fields (generated by an off-the-shelf model like Co-Tracker3) using an $L_2$ loss.
Output: It converts raw motion vectors into fine-grained, dense motion embeddings that preserve temporal dynamics without the cost of full optical flow computation.

C. Hierarchical Motion State Space (HMSS) Module

To handle the extreme sequence lengths of long videos while avoiding quadratic attention, ReMoRa employs a State Space Model (SSM), specifically a Mamba-based architecture, structured in two stages:

Codec-Aware Selective Scan (Local): Within each GOP, a bidirectional Mamba block fuses the I-frame appearance embeddings with the refined motion embeddings. This creates a single, motion-aware summary token for the GOP.
Bidirectional Token Mixer (Global): The summary tokens from all GOPs are processed by a global SSM layer. This models long-range temporal dependencies across the entire video in linear time ( $O(N)$ ), effectively capturing global context without the memory overhead of transformers.

D. Alignment and Generation

The aggregated video features are projected into the embedding space of a pretrained LLM (Qwen2-7B) and concatenated with instruction tokens to generate autoregressive responses.

3. Key Contributions

Compressed-Domain Processing: ReMoRa is the first video MLLM to directly utilize compressed video streams (I-frames + Motion Vectors) as primary inputs, eliminating the need for redundant RGB frame decoding.
Refined Motion Representation (RMR): A novel module that transforms noisy, block-based codec motion vectors into high-fidelity, dense motion representations, bridging the gap between compression artifacts and optical flow quality.
Hierarchical Motion State Space (HMSS): A linear-complexity temporal modeling module that mirrors the hierarchical structure of video codecs (GOPs), enabling efficient long-range reasoning.
Scene-Adaptive Preprocessing: A pipeline that dynamically inserts I-frames based on scene changes rather than fixed time intervals, ensuring optimal alignment between visual content and motion cues.

4. Experimental Results

The authors evaluated ReMoRa on a comprehensive suite of long-video benchmarks, comparing it against strong baselines like LLaVA-Video, BIMBA, and LongVU.

Quantitative Performance:
- LongVideoBench: 60.8 (Best), outperforming the second-best by 1.3 points.
- NExT-QA: 84.2 (Best), outperforming the second-best by 1.0 points.
- MLVU: 72.1 (Best), outperforming the second-best by 1.3 points.
- Average Score: 69.8, the highest among all compared models.
- ReMoRa also achieved competitive results on VideoMME and Perception Test.
Qualitative Analysis:
- ReMoRa demonstrated superior ability to track subtle, sequential human actions (e.g., "checking pants" after sliding) and distinguish specific motion patterns (e.g., "bouncing a ball" vs. "throwing a frisbee") where baseline models relying on sparse RGB frames failed.
Efficiency:
- ReMoRa achieved comparable throughput to BIMBA but with significantly lower memory usage (10.59 GB vs. 23.21 GB for LLaVA-Video), demonstrating high computational efficiency.

5. Significance and Impact

Scalability: By operating in the compressed domain and using linear-time SSMs, ReMoRa offers a scalable solution for processing hour-long videos, a task previously limited by computational costs.
Motion-Centric Understanding: The paper highlights that explicit motion modeling (via refined vectors) is crucial for long-video reasoning, often more so than dense appearance sampling.
Future Direction: ReMoRa establishes a new paradigm for "codec-aware" MLLMs, suggesting that future models should leverage the structural priors of video compression (GOPs, motion vectors) rather than treating video as a sequence of independent images.

In conclusion, ReMoRa effectively addresses the redundancy and computational bottlenecks of long-video understanding by integrating compressed-domain motion cues with advanced state-space modeling, achieving state-of-the-art performance while maintaining high efficiency.