LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models

Imagine you have a very talented robot chef. If you show it a picture of a kitchen and say, "Make me a sandwich," it does it perfectly 95% of the time. It looks like a genius.

But here's the catch: The robot isn't actually listening to you.

It's more like a person who has memorized a specific dance routine. If you show them the kitchen, they automatically know to grab the bread and butter because that's what they've seen a thousand times before. They don't care if you whisper, "Actually, I want a salad," or scream, "Make me a pizza!" They just keep making the sandwich because the picture hasn't changed.

This is the problem the paper "LangGap" tackles. The authors discovered that today's most advanced robot brains (called Vision-Language-Action models) are "cheating." They rely on what they see (the visual shortcuts) and ignore what they hear (the language instructions).

Here is a breakdown of their discovery and solution, using some everyday analogies:

1. The "Same Scene, Different Orders" Test

To prove the robots were cheating, the authors created a clever test called LangGap.

Imagine a restaurant table with a plate, a bowl, and a drawer.

The Old Way: In previous tests, if the table had a plate, the robot was always told to put the bowl on the plate. The robot just memorized: "Plate = Put Bowl Here."
The LangGap Way: The authors kept the table exactly the same but changed the words you gave the robot.
- Scenario A: "Put the bowl on the plate."
- Scenario B: "Put the bowl on the stove."
- Scenario C: "Put the cup on the plate."

If the robot was truly smart, it would listen to the words. If it was just memorizing pictures, it would fail when the words changed.

2. The Shocking Results: The "Target" Blindness

When they tested the best robot model (called $\pi0.5$ ) on this new game, the results were hilarious and scary:

Original Instructions: 95% success. (The robot knows the routine.)
Change the Object: "Put the cup instead of the bowl." The robot got it right about 30% of the time. (It's getting a little confused, but guessing.)
Change the Target: "Put the bowl on the stove instead of the plate." The robot got it 0% right.

The Metaphor: It's like a GPS that has memorized the route to "Home." If you tell it to go to "Work," it ignores you and drives to "Home" anyway because the map looks the same. The robot completely ignores where you want to go; it only cares about what it sees.

3. The "Training Diet" Experiment

The authors asked: "Can we fix this by feeding the robot more examples?"

They tried a "diet" of new instructions where the robot had to learn to listen to words, not just pictures.

Small Diet: If they taught the robot just one new trick (e.g., "Put the bowl on the stove"), the robot learned it instantly! Success jumped from 0% to 90%.
Big Diet: But when they tried to teach the robot many different tricks at once (putting things on 16 different spots, with different objects), the robot got overwhelmed. Its performance crashed back down.

The Analogy: Imagine a student who is great at memorizing one math formula. If you give them one new problem type, they can solve it. But if you give them a whole textbook of new, complex problems at once, they panic and forget everything. The robot's brain is currently too "stubborn" to learn that language matters when there are too many options.

4. Why This Matters

The paper concludes that we can't just build bigger robots or give them more data to fix this. The current "brain architecture" is fundamentally broken when it comes to listening.

The Problem: Robots are "visual memorizers," not "language understanders."
The Solution: We need to stop testing robots with easy, repetitive tasks. We need to force them to listen by giving them the same visual scene but different instructions (the LangGap benchmark).
The Future: To make robots that truly understand us, we need to change how they are built (the architecture) AND give them better, more diverse training data.

The Bottom Line

Right now, our smartest robots are like actors who have memorized their lines but don't understand the script. If the director changes the scene slightly, the actor freezes or keeps doing the old scene.

LangGap is the tool that finally exposes this flaw. It forces the robots to prove they can actually listen, not just look. And right now, they are failing that test miserably.

1. Problem Statement

Despite achieving high success rates (>95%) on standard robotic manipulation benchmarks like LIBERO, state-of-the-art Vision-Language-Action (VLA) models (e.g., $\pi_0.5$ ) exhibit a critical flaw: they largely ignore language instructions, relying instead on visual shortcuts and memorization.

The Gap: Existing benchmarks often assign only one task per visual layout. Consequently, models can learn a direct mapping from visual input to action without processing the linguistic instruction.
Limitations of Prior Work: Previous diagnostic methods (e.g., LIBERO-Plus) use coarse-grained perturbations (like paraphrasing) that fail to isolate specific semantic failures. Furthermore, existing solutions focus on architectural changes (rebalancing modalities) rather than addressing the root cause: insufficient linguistic diversity in training data.

2. Methodology

The authors propose a three-pronged approach: a diagnostic framework, a new benchmark, and a training validation protocol.

A. Semantic Perturbation Framework

To diagnose which parts of language models fail to understand, the authors designed a four-dimensional semantic perturbation taxonomy. The core principle is to keep the visual tabletop layout fixed while varying the instruction semantics. This forces the model to rely on language rather than visual memorization.

The four dimensions are:

Change Object Category: Swapping the object noun (e.g., "pick bowl" $\to$ "pick ramekin").
Change Target: Swapping the destination location (e.g., "place on plate" $\to$ "place on stove").
Spatial Description: Distinguishing identical object instances via spatial relations (e.g., "bowl to the right of the ramekin" $\to$ "bowl to the right of the plate").
Drawer Action: Changing the action type (e.g., "put bowl" $\to$ "open drawer").

B. The LangGap Benchmark

Based on the diagnostic framework, the authors constructed LangGap, a 99-task benchmark designed to enforce language reliance by design.

Same-Scene Multi-Task: Multiple distinct tasks share the exact same initial visual state. A model ignoring language would achieve at most $1/k $success (where$ k$ is the number of tasks), making language the only discriminative signal.
Instruction-Level Split: Training data does not include all test instructions, ensuring evaluation on unseen semantic combinations.
Physical Feasibility: All extended tasks are verified in the simulator to ensure graspability and reachability.

C. Training Data Collection

The authors collected a dataset of $\sim$ 2,400 scripted demonstrations using waypoint-based controllers in RoboSuite. They utilized an instruction-level split, selecting 16 representative extended tasks for training and holding out 43 tasks for testing, ensuring the test set contains novel semantic instructions not seen during training.

3. Key Contributions

Fine-Grained Diagnostic Method: A four-dimensional taxonomy that reveals differential failure modes (e.g., models fail completely on target changes but partially on object changes), moving beyond the coarse conclusion that "language is ignored."
LangGap Benchmark: The first VLA benchmark that forces language reliance by design through same-scene, multi-task diversity, preventing visual memorization.
Training Validation: Empirical evidence showing that while targeted data augmentation can close the gap at small scales, current models hit a fundamental ceiling as semantic diversity increases.

4. Key Results

Diagnostic Findings (on Pre-trained $\pi_0.5$ )

Original Tasks: 93.8% success rate.
Semantic Perturbations: Success rate drops to 21.4% overall.
Differential Failure:
- Change Target: 0.0% success (across 13 tasks). The model completely ignores target location instructions.
- Change Object: 29.3% success.
- Spatial Description: 11.0% success.
- Drawer Action: 31.7% success.
Conclusion: Models have partial understanding of object nouns but almost zero understanding of spatial goals (targets).

Training Experiments

The authors fine-tuned $\pi_0.5$ using LoRA with varying scales of extended data:

Single-Task: Success jumped from 3.75% to 90%, proving the model can learn the specific task via memorization.
Small Scale (6-task): Extended tasks improved from 0% to 28%.
Large Scale (16-task): Success dropped to 6.2% when training only on extended data.
Dilution Effect: Adding official LIBERO data to extended data (e.g., 45-task or 56-task configurations) diluted performance on the novel extended tasks.
- Example: 45-task (40 original + 5 extended) achieved only 4% on extended tasks, compared to 28% for the 6-task model.
Scaling Limit: As the number of diverse tasks increased, the model's capacity to generalize language understanding did not scale linearly. Even with targeted data, performance on novel semantic variations remained low.

5. Significance and Implications

Fundamental Limitation: The paper reveals that current VLA models lack the capacity to learn generalizable language understanding from data alone. Simply adding more diverse data without architectural changes leads to a "dilution effect" where the model forgets how to handle novel instructions.
Benchmark Longevity: Unlike existing benchmarks that reach performance saturation quickly, LangGap's design ensures long-term evaluation value as it forces models to solve the "language gap" rather than visual memorization.
Future Directions: The authors conclude that closing the language gap requires a combination of data-centric approaches (increasing linguistic diversity) and architectural innovations (dedicated mechanisms for spatial relations and modality rebalancing). The benchmark serves as a critical tool for measuring genuine language understanding in future VLA development.

LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models

1. The "Same Scene, Different Orders" Test

2. The Shocking Results: The "Target" Blindness

3. The "Training Diet" Experiment

4. Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology

A. Semantic Perturbation Framework

B. The LangGap Benchmark

C. Training Data Collection

3. Key Contributions

4. Key Results

Diagnostic Findings (on Pre-trained π0.5\pi_0.5π0​.5)

Training Experiments

5. Significance and Implications

More like this

When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews

Demystifying When Pruning Works via Representation Hierarchies

Fine-Tuning A Large Language Model for Systematic Review Screening

Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset

Enhancing Structured Meaning Representations with Aspect Classification

Diagnostic Findings (on Pre-trained $\pi_0.5$ )