A Pragmatic VLA Foundation Model

Imagine you want to teach a robot to do chores, like making a sandwich or organizing a messy room. In the past, you had to write specific code for every single movement, which was like teaching a robot to walk by manually moving its legs one step at a time. It was slow, expensive, and the robot couldn't handle anything new.

This paper introduces LingBot-VLA, a new kind of "brain" for robots that changes the game. Think of it not as a robot that follows a script, but as a super-intelligent apprentice that learns by watching humans do things thousands of times.

Here is the breakdown of what they did, using simple analogies:

1. The "Gym" for Robots (The Data)

To make a robot smart, you need to feed it data. Most previous robots were trained on a small diet of maybe a few hundred hours of video.

The Analogy: Imagine trying to learn to cook. If you only watched 10 cooking shows, you might know how to boil an egg, but you'd be lost if asked to make a complex soufflé.
What they did: The team collected 20,000 hours of real-world video. They used 9 different types of robot arms (some look like human arms, some are industrial) to perform all sorts of tasks: peeling lemons, folding towels, sorting blocks, and even assembling a "Luban lock" (a complex Chinese puzzle).
The Result: This is like giving the robot a lifetime of experience in a few weeks. They didn't just watch; they watched everything.

2. The "Brain" Architecture (The Model)

The model is called a VLA (Vision-Language-Action) model.

Vision: It has eyes (cameras) to see the world.
Language: It understands what you say (e.g., "Make me a sandwich").
Action: It knows how to move its arms to do it.
The Analogy: Think of the robot's brain as having two best friends working together:
1. The Scholar: A massive language model that understands the world and instructions (like a very smart librarian).
2. The Athlete: A specialized module that figures out the physical movements (like a gymnast).
- Usually, these two struggle to talk to each other. LingBot-VLA uses a special "Mixture-of-Transformers" architecture. Imagine them sitting at the same table, sharing a single notepad, so the Scholar can instantly tell the Athlete, "The bread is slippery, grab it gently," and the Athlete adjusts its grip immediately.

3. The "Speed Run" (Efficiency)

Training these models usually takes forever and costs a fortune in computer power.

The Analogy: Training a robot used to be like trying to fill a swimming pool with a garden hose.
What they did: They built a new, super-optimized software codebase. It's like upgrading that garden hose to a high-pressure firehose.
The Result: They can process data 1.5 to 2.8 times faster than other leading systems. This means researchers can train better robots in less time and for less money.

4. The "Final Exam" (The Results)

They didn't just test this in a video game; they tested it in the real world.

The Setup: They put the robot through 100 different tasks (like the GM-100 benchmark) on 3 different physical robot bodies.
The Competition: They pitted LingBot-VLA against other top-tier robot brains (like $\pi_0.5$ and GR00T).
The Score:
- Success Rate: How often the robot actually finished the job.
- Progress Score: Even if it failed, how far did it get?
- The Outcome: LingBot-VLA won. It was significantly better at completing tasks, especially when the environment changed (e.g., if the table was messy or the object was in a weird spot).
- The "Depth" Trick: They found that adding "depth perception" (knowing exactly how far away things are in 3D space) made the robot even more precise, like giving it 3D glasses instead of just 2D vision.

5. The "Scaling Law" Discovery

One of the most important findings is about how much data is enough.

The Analogy: In school, you might think studying for 10 hours is enough to pass a test. But what if studying for 100 hours makes you a genius?
The Discovery: They tested the robot with 3,000 hours of data, then 10,000, then 20,000. The robot got better and better with every extra hour. It didn't hit a "ceiling" where it stopped learning. This proves that for robots, more data is always better, and we should keep collecting as much as possible.

Why This Matters

This paper is a big deal because it moves robot learning from "theoretical" to "practical."

It's Open Source: They gave away the code, the model, and the data. It's like giving everyone the recipe and the ingredients so the whole world can cook better robots.
It's General: The robot isn't just good at one thing; it's a "generalist" that can handle a wide variety of tasks on different hardware.
It's Efficient: They showed you don't need a supercomputer the size of a building to train a smart robot; you just need the right software.

In short: They built a robot apprentice that watched 20,000 hours of human work, learned to understand instructions and physical space perfectly, and can now do complex chores better than any other robot we've seen before. And they shared the blueprints with everyone.

1. Problem Statement

Vision-Language-Action (VLA) foundation models hold significant promise for enabling robots to perform diverse manipulation tasks guided by natural language. However, the field faces three critical challenges:

Scaling Uncertainty: There is a lack of comprehensive empirical evidence regarding how real-world robot performance scales with increasingly vast pre-training datasets. It is unclear if performance saturates or continues to improve with massive data volumes.
Evaluation Gaps: Existing benchmarks often rely on simulation or limited real-world tasks, failing to capture the complexity and diversity of real-world deployment. There is a need for rigorous, large-scale evaluation across multiple robotic platforms.
Computational Inefficiency: Training large-scale VLA models on massive real-world datasets is computationally expensive. Existing open-source codebases often suffer from data I/O bottlenecks and communication overheads, making rapid iteration and scaling difficult.

2. Methodology

A. Data Collection and Curation

The authors constructed LingBot-VLA, a foundation model trained on approximately 20,000 hours of real-world teleoperated data.

Diversity: The dataset covers 9 distinct dual-arm robot configurations (e.g., AgiBot G1, AgileX, Galaxea R1Lite/Pro, Realman, Leju KUAVO, Qinglong, ARX Lift2, Bimanual Franka).
Annotation:
- Video Segmentation: Raw videos were decomposed into atomic action clips by human annotators, removing static frames to reduce redundancy.
- Instruction Generation: A large VLM (Qwen3-VL-235B) was used to generate precise task and sub-task instructions for the video clips.
- Spatial Awareness: The model integrates depth information via a distillation approach, aligning learnable queries with depth tokens from a dedicated depth model (LingBot-Depth).

B. Model Architecture

LingBot-VLA employs a Mixture-of-Transformers (MoT) architecture:

Backbone: Integrates a pre-trained Vision-Language Model (VLM), specifically Qwen2.5-VL, as the semantic backbone.
Action Expert: A dedicated module for generating continuous action sequences.
Unified Modeling: The VLM and Action Expert are coupled via a shared self-attention mechanism. This allows high-dimensional semantic priors from the VLM to guide the action expert across all layers while maintaining modality-specific processing to reduce interference.
Input/Output:
- Observation ( $O_t$ ): Concatenation of multi-view operational images ( $I_{1,2,3}$ ), task instructions ( $T_t$ ), and robot state ( $s_t$ ).
- Action ( $A_t$ ): A chunk of future actions predicted over a temporal horizon ( $T=50$ ).
Training Objective: Uses Flow Matching for continuous action modeling. The model learns to predict the vector field between Gaussian noise and ground-truth actions, ensuring fluid and smooth control.
Spatial Distillation: To enhance geometric reasoning, the model minimizes a distillation loss ( $L_{distill}$ ) between VLM learnable queries and depth tokens, explicitly injecting 3D spatial awareness.

C. Training Efficiency Optimization

To handle the massive dataset and ensure cost-effectiveness, the authors developed a highly optimized codebase:

Distributed Strategy: Utilizes Fully Sharded Data Parallel (FSDP) with a hybrid sharding strategy (inspired by VeOmni) that creates specific "shard groups" for the Action Expert to mitigate communication overhead. Mixed-precision training (bfloat16 for storage/communication, float32 for reductions) ensures stability.
Operator Optimization: Leverages FlexAttention for sparse attention processes and torch.compile for operator fusion to maximize memory bandwidth and reduce kernel launch overhead.
Performance: Achieves a throughput of 261 samples per second on an 8-GPU setup, representing a 1.5x to 2.8x speedup over existing VLA codebases (StarVLA, Dexbotic, OpenPI).

3. Key Contributions

LingBot-VLA Model: A pragmatic VLA foundation model trained on 20,000 hours of diverse, multi-embodiment real-world data.
Systematic Scaling Analysis: Provides the first empirical evidence that VLA performance scales favorably with data volume up to 20,000 hours, showing no signs of saturation.
High-Performance Codebase: An open-source, optimized training framework that significantly reduces training time and computational costs, enabling large-scale scaling studies.
Comprehensive Benchmarking: A rigorous evaluation framework involving 100 tasks (GM-100 benchmark) across 3 distinct robotic platforms (AgileX, AgiBot G1, Galaxea R1Pro), with 130 episodes per task per platform.
Open Science: Full release of the code, base model checkpoints, and benchmark data to foster community progress.

4. Experimental Results

Real-World Performance (GM-100 Benchmark)

Evaluated on 100 tasks across 3 platforms (25 robots total), LingBot-VLA significantly outperformed state-of-the-art baselines (WALL-OSS, GR00T N1.6, $\pi_0.5$ ).

Success Rate (SR): LingBot-VLA (with depth) achieved an average SR of 17.30%, compared to 13.02% for $\pi_0.5$ and 7.59% for GR00T N1.6.
Progress Score (PS): Achieved an average PS of 35.41%, outperforming $\pi_0.5$ (27.65%) by a significant margin.
Depth Impact: Incorporating depth information improved the average SR by 4.28% and PS by 7.76% over the version without depth.
Generalization: The model demonstrated strong cross-embodiment generalization, performing well on platforms not explicitly seen in the same configuration during pre-training.

Simulation Performance (RoboTwin 2.0)

Evaluated on 50 tasks in both clean and highly randomized environments.

Clean Scenes: Achieved 88.56% SR (vs. 82.74% for $\pi_0.5$ ).
Randomized Scenes: Achieved 86.68% SR (vs. 76.76% for $\pi_0.5$ ), demonstrating superior robustness to domain randomization.

Scaling and Efficiency

Data Scaling: Experiments scaling data from 3,000 to 20,000 hours showed a consistent upward trend in both Success Rate and Progress Rate, with no saturation observed.
Training Throughput: The optimized codebase demonstrated near-linear scaling efficiency as GPU count increased from 8 to 256, significantly outperforming baselines in aggregate throughput.
Data Efficiency: In post-training experiments, LingBot-VLA outperformed $\pi_0.5$ even when trained on only 80 demonstrations per task, whereas $\pi_0.5$ required 130 demonstrations to reach lower performance levels.

5. Significance

This work establishes a new standard for pragmatic VLA development by demonstrating that:

Data Volume Matters: Real-world robot learning benefits from massive, diverse datasets, and scaling laws hold true even at the 20,000-hour mark.
Efficiency is Critical: High-throughput training infrastructure is essential for making large-scale robot learning feasible and cost-effective.
Real-World Validation: True generalization must be validated on diverse physical platforms with rigorous, large-scale benchmarks rather than relying solely on simulation.

By releasing the model, code, and data, the authors aim to accelerate the field of embodied AI, enabling researchers to tackle more complex manipulation tasks and establish sound evaluation standards.