Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Imagine you are teaching a robot to do chores, like putting a cup on a table or stacking blocks. In the past, robots were like blindfolded chefs: they could hear your instructions ("put the cup here") and see the kitchen, but they had to guess how to move their arms to get the job done. They often stumbled because they didn't understand the consequences of their movements.

Enter Mantis, a new type of robot brain that changes the game. Think of Mantis not just as a robot, but as a robot with a crystal ball.

Here is the simple breakdown of how it works, using some everyday analogies:

1. The Problem: The "Overworked Brain"

Previous robot models tried to do two huge jobs at once with one brain:

Understand the world (e.g., "That's a red cup, and I need to pick it up").
Predict the future (e.g., "If I move my arm this way, the cup will end up there").

Trying to do both simultaneously is like asking a student to write a history essay and solve a complex math problem at the exact same time. The brain gets overwhelmed, the math gets wrong, and the essay is boring. The robot either forgets how to reason or moves clumsily.

2. The Solution: The "Disentangled" Crystal Ball

Mantis introduces a clever trick called Disentangled Visual Foresight. Imagine Mantis has a specialized assistant (the "Crystal Ball") who lives in a separate room.

The Main Brain (The Chef): Focuses entirely on understanding your voice and the scene. It knows what you want and what things are.
The Assistant (The Crystal Ball): Its only job is to look at the current scene and say, "If we move the arm like this, the cup will look like that in one second."

Mantis doesn't ask the main brain to do the heavy lifting of predicting every pixel of the future. Instead, it asks the Assistant to simulate the future. The Assistant then whispers a secret code (called "latent actions") to the Main Brain: "Hey, to get the cup there, you need to move your arm slightly up and right."

This separation allows the Main Brain to stay sharp at understanding language and reasoning, while the Assistant handles the physics of movement.

3. The Training: Learning from Humans and Robots

Mantis was trained in three distinct phases, like a student progressing through school:

Phase 1: The Human Observer. Mantis watched 220,000 videos of humans doing things (like opening jars or stacking blocks). It didn't know how to do it yet, but it learned how objects move and interact. It learned the "physics" of the world.
Phase 2: The Robot Apprentice. Mantis watched 76,000 videos of actual robots doing tasks. Now it connected the "physics" it learned from humans to the specific movements of robot arms.
Phase 3: The Language Tutor. Finally, Mantis studied 38 different datasets of images and text (like a massive library of picture books). This ensured that when you say, "Put the cup on the Iron Man statue," Mantis actually knows who Iron Man is and doesn't just guess.

4. The "Smart Pause" (Adaptive Temporal Ensemble)

One of Mantis's coolest features is how it moves.

The Old Way: Some robots are like a nervous driver who checks the rearview mirror every 0.1 seconds, even when driving on a straight, empty highway. This wastes energy and makes the ride jerky.
Mantis's Way (ATE): Mantis is like a smart cruise control.
- If it's just moving an empty arm across the room, it moves fast and checks less often (saving energy).
- If it's trying to place a cup on a tiny, wobbly coaster, it instantly switches to "high-precision mode," checking its position constantly to ensure it doesn't spill.

This "Adaptive Temporal Ensemble" (ATE) makes Mantis 50% faster at making decisions without losing accuracy.

The Results: Why It Matters

When tested, Mantis didn't just win; it dominated.

In Simulations: It achieved a 96.7% success rate on complex tasks, beating previous top models.
In the Real World: When asked to do things it had never seen before (like "Put the cup on the female singer"), Mantis understood the concept and found Taylor Swift. A competing robot (π0.5) got confused and failed because it lacked the language reasoning Mantis developed.

The Bottom Line

Mantis is like giving a robot a separate "future-simulating" brain so its main brain can focus on being smart, understanding you, and reasoning through problems. It learns from watching humans, practices with robots, and reads books to understand the world. The result? A robot that doesn't just follow orders blindly but actually understands what it's doing and can adapt to new, tricky situations.

1. Problem Statement

Vision-Language-Action (VLA) models aim to translate linguistic instructions and visual observations into executable robotic actions. However, existing approaches face three fundamental challenges:

The Supervision Mismatch: High-dimensional visual inputs often overwhelm the model, while the action signals (low-dimensional) are too sparse to effectively supervise the large VLA backbone. This leads to underutilized representational capacity.
The Visual Foresight Trade-off: Integrating visual foresight (predicting future frames) can help, but directly predicting high-dimensional future states distributes model capacity away from action learning and incurs prohibitive training costs. Conversely, compressing visual states into compact signals creates information bottlenecks, losing fine-grained motion details.
Reasoning Degradation: Many existing VLA models neglect language supervision during training, causing the model to overwrite its pre-trained semantic understanding and reasoning capabilities, leading to poor instruction following and generalization.

2. Methodology: Mantis Framework

The authors propose Mantis, a novel VLA framework centered around Disentangled Visual Foresight (DVF). The core philosophy is to decouple visual foresight prediction from the main action learning backbone to preserve reasoning capabilities while still leveraging visual dynamics.

A. Model Architecture

Mantis consists of four main components:

VLM Backbone ( $P$ ): Uses Qwen2.5-VL as the foundation, chosen for its robust reasoning and flexible resolution handling.
DVF Head ( $D$ ): A Diffusion Transformer (DiT) based on Sana. It predicts future visual states ( $o_{t+n}$ ) given the current state ( $o_t$ ) and language instruction ( $l$ ).
Connector ( $C$ ): Bridges the backbone output to the DiT input space.
Action Head ( $\pi$ ): A DiT-based head that predicts action trajectories.

Key Innovation: Disentangled Visual Foresight

Meta-Queries & Latent-Action Queries: Instead of forcing the backbone to generate future frames, Mantis introduces learnable latent-action queries [LAT]. These queries, combined with a residual connection of the current visual state into the DiT, automatically capture inter-frame dynamics (latent actions) necessary to transition from $o_t$ to $o_{t+n}$ .
Decoupling: The DVF head operates somewhat independently from the action head. The [LAT] queries extract the "visual trajectory" dynamics, which are then fed into the action head via causal attention. This allows the backbone to focus on semantic understanding (via language supervision) while the DVF head handles the visual prediction.
Multi-Gap Queries: To handle diverse temporal dynamics, the model uses [GAP] queries to predict future frames at varying time intervals (1 to 6 steps).

B. Progressive Training Recipe

To avoid cross-modal competition and ensure stable convergence, Mantis employs a three-stage progressive training strategy:

Stage 1 (Multiple Gap Vision Training): Trained on human manipulation videos (SSV2) to predict future frames. The backbone is frozen; only the DVF head and queries are optimized. This teaches general manipulation skills and world knowledge.
Stage 2 (Vision-Action Joint Training): Introduced robot demonstration data (DROID). The model optimizes a combined loss ( $\alpha L_{DVF} + L_{action}$ ) to align visual foresight with explicit actions. The backbone remains frozen.
Stage 3 (Language Supervised Mix Training): The backbone is unfrozen and trained on a mix of 38 multimodal datasets (e.g., LLaVA-Instruct, COCO) alongside robot data. This preserves the model's reasoning and instruction-following capabilities.

C. Adaptive Temporal Ensemble (ATE)

Standard Temporal Ensemble (TE) improves motion stability but is computationally expensive. Mantis introduces ATE, which dynamically adjusts the ensemble strength:

Target Patches: Regions relevant to the instruction (identified via text-to-vision attention).
Dynamic Patches: Regions with significant visual changes (identified via inter-frame cosine similarity).
Logic: If there is an overlap between target and dynamic patches (indicating fine-grained manipulation like grasping), TE is activated for stability. Otherwise, it is disabled to save inference time.

3. Key Contributions

Disentangled Visual Foresight (DVF): A novel architecture that separates visual prediction from action learning using meta-queries and a DiT head, enabling the model to capture latent actions without overburdening the backbone.
Progressive Training Strategy: A staged training recipe that successfully integrates vision, action, and language modalities, preventing catastrophic forgetting of reasoning abilities.
Adaptive Temporal Ensemble (ATE): An inference-time strategy that balances motion stability and computational efficiency, reducing inference counts by ~50% without performance loss.
State-of-the-Art Performance: Mantis achieves superior results on both simulation benchmarks and real-world robot experiments.

4. Experimental Results

Simulation Benchmarks (LIBERO)

Success Rate: Mantis achieved a 96.7% average success rate on the LIBERO benchmark, outperforming strong baselines like OpenVLA (76.5%), $\pi_0$ (94.2%), and UnifiedVLA (95.5%).
Convergence Speed: Mantis demonstrated significantly faster convergence than entangled visual foresight models (e.g., UnifiedVLA), reaching high success rates in fewer epochs. This validates the efficiency of the disentangled design.

Real-World Experiments (Agilex Platform)

Instruction Following: Mantis was tested against $\pi_{0.5}$ $π_{0.5}$ (a leading open-source VLA) on in-domain (ID) and out-of-domain (OOD) instructions.
- ID Tasks: Mantis outperformed $\pi_{0.5}$ .
- OOD Tasks: Mantis showed strong generalization (e.g., understanding "Put the cup on Taylor Swift" vs. "Put the cup on the female singer"), whereas $\pi_{0.5}$ failed almost entirely on OOD tasks.
Reasoning: Mantis successfully handled tasks requiring world knowledge (identifying celebrities) and basic arithmetic (calculating numbers), proving the efficacy of language supervision.

Efficiency (ATE)

The Mantis-ATE variant reduced inference calls by 50% (e.g., from ~154 to ~77 counts) while maintaining comparable success rates to the standard Temporal Ensemble version.

Ablation Studies

DVF Variants: Pre-trained DVF performed best, followed by vanilla DVF. Removing the residual connection (flawed-DVF) or the DVF entirely (no-DVF) significantly degraded performance, confirming the necessity of latent action learning.
Language Supervision: Removing language supervision (Mantis-LU) resulted in a drastic drop in OOD generalization, confirming that language supervision is critical for reasoning and adaptability.

5. Significance

Mantis represents a significant step forward in embodied AI by solving the "capacity vs. reasoning" dilemma in VLA models.

Scalability: It demonstrates that visual foresight can be effectively integrated without sacrificing the model's ability to understand complex language or reason about the world.
Generalization: The strong performance on OOD instructions suggests that Mantis can adapt to unseen tasks and environments, a crucial requirement for real-world deployment.
Efficiency: The ATE mechanism addresses the computational bottleneck of ensemble methods, making high-performance VLA models more practical for real-time robotic control.

In summary, Mantis provides a robust, efficient, and reasoning-capable framework for robotic manipulation, setting a new benchmark for VLA models in both simulation and the real world.