When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

Imagine you have a very smart robot assistant. You've trained it by showing it thousands of videos of it doing specific chores, like "pick up the red cup" or "put the cookie in the box." The robot is great at these tasks because it has memorized the visual patterns: Red cup = grab here. Cookie box = move there.

But here's the problem: The robot is too reliant on what it sees and not enough on what you say.

This paper is about a glitch called "Counterfactual Failure." It happens when you give the robot an instruction that doesn't match its training, but the scene looks familiar.

The "Musical Chairs" Analogy

Imagine a game of musical chairs.

The Scene: There is a table with a Tape dispenser and a Mustard bottle.
The Training: The robot practiced 1,000 times to "Pick up the Tape." It learned that Tape = Action.
The Test: You walk in and say, "Please Pick up the Mustard."

What happens without the fix?
The robot looks at the table. It sees the Tape. Its brain screams, "I know this! I've done this a thousand times!" It ignores your voice completely, grabs the Tape, and puts it down. It failed to listen to you because the visual cue (the Tape) was too loud.

What happens with the fix?
The robot pauses. It realizes, "Wait, the user said 'Mustard,' not 'Tape.' Even though the Tape is right there, I need to listen to the instruction." It picks up the Mustard.

The Core Problem: "Vision Shortcuts"

The authors found that current robot brains (called VLAs or Vision-Language-Action models) are lazy. They take "vision shortcuts."

The Shortcut: "If I see a tape dispenser, I grab the tape. I don't need to read the text."
The Result: If you ask the robot to do something new with the same objects (like "Pick up the mustard" when the tape is also there), it fails. It defaults to its old habits.

This is dangerous. If a robot ignores your voice because it sees a familiar object, it could be unsafe or just useless in a real home.

The Solution: "Counterfactual Action Guidance" (CAG)

The paper proposes a clever trick called CAG. Think of it as giving the robot a second opinion before it moves.

Imagine the robot has two internal voices:

The "Habit" Voice (Vision-Only): This voice says, "I see a tape! I should grab the tape! I've done this a million times!"
The "Instruction" Voice (Language-Conditioned): This voice says, "The human said 'Mustard.' I should grab the mustard."

How CAG works:
Instead of just letting the "Habit" voice win, CAG forces the robot to compare the two.

It asks: "What would I do if I only looked at the picture?" (The Habit Voice).
It asks: "What would I do if I only listened to the human?" (The Instruction Voice).
The Magic: It calculates the difference between the two. If the Habit Voice wants to grab the Tape, but the Instruction Voice wants the Mustard, CAG amplifies the Instruction Voice. It essentially tells the robot: "Ignore the habit. Trust the words."

The "LIBERO-CF" Benchmark: The Robot's Final Exam

To prove this was a real problem, the authors created a special test called LIBERO-CF.

Think of this as a trick exam for the robot.

The Setup: They put familiar objects on a table (like a tape dispenser).
The Trick: They give the robot instructions it has never seen before, like "Pick up the mustard" (when the mustard was just background noise in training).
The Result: Without the fix, the robots failed miserably (often less than 10% success). They just grabbed the tape.
With CAG: The robots suddenly got much better (improving success rates by huge margins). They finally started listening to the instructions instead of just staring at the objects.

Why This Matters

This isn't just about robots picking up tape. It's about making AI reliable.

Current AI: "I see a red ball, so I must kick it." (Even if you said "Don't kick it, pick it up.")
Future AI (with CAG): "I see a red ball, but you said 'pick it up.' I will pick it up."

The authors show that you don't need to rebuild the robot's brain or retrain it for years. You just need to add this "second opinion" step (CAG) when the robot is thinking. It's a simple, plug-and-play fix that makes robots much better at following human orders, even when the world looks exactly like their training data.

In short: The paper teaches robots to stop being "visual bullies" who only do what they've seen before, and start being "good listeners" who actually follow your instructions.

1. Problem Statement: Counterfactual Failures in VLAs

Vision-Language-Action (VLA) models are designed to ground natural language instructions into robot control policies. However, the authors identify a critical, prevalent failure mode termed Counterfactual Failures.

The Phenomenon: When presented with instructions that are visually plausible but linguistically distinct from the training data (e.g., asking to pick up a "mustard" bottle when the scene contains a "tape" roll that was frequently picked up during training), VLAs often ignore the language instruction. Instead, they default to executing well-learned, scene-specific behaviors associated with the visual context.
Root Cause: This is attributed to modality imbalance in robotic datasets.
- Data Bias: Training datasets often contain limited textual diversity compared to visual and action modalities. Under a fixed visual scene, only a subset of tasks is demonstrated, causing the model to learn "vision shortcuts" (associating the scene directly with a specific action) rather than faithfully grounding language.
- Bayesian Collapse: Ideally, a VLA policy $P(a|o, l)$ should balance the visual prior $P(a|o)$ and the language likelihood $P(l|a, o)$ . In practice, the posterior collapses toward the visual prior ( $P(a|o, l) \approx P(a|o)$ ), rendering language conditioning weak or secondary.
Consequence: Even in familiar environments, robots fail to follow user intent, posing safety and usability risks for real-world deployment.

2. Methodology

A. LIBERO-CF Benchmark

To systematically study this issue, the authors introduce LIBERO-CF, the first counterfactual benchmark for VLAs.

Design: It utilizes the standard LIBERO manipulation layouts but assigns alternative, feasible language instructions that were under-observed or unseen during the model's fine-tuning.
Four Suites:
1. CF-Spatial: Targeting objects that were originally background elements.
2. CF-Object: Targeting different objects than those in the training task.
3. CF-Long: Multi-step, long-horizon instructions with new targets.
4. CF-OOD: Instructions involving Out-of-Distribution (unseen) objects.
Metrics:
- Grounding Rate: Measures if the robot interacts with the instructed object (faithful).
- Success Rate: Measures if the task is completed.
- Bias: Quantifies how often the robot defaults to the original training task despite counterfactual instructions.

B. Counterfactual Action Guidance (CAG)

To mitigate these failures, the authors propose CAG, a plug-and-play, dual-branch inference scheme that enhances language conditioning without modifying model architectures or retraining weights.

Core Concept: Inspired by Classifier-Free Guidance (CFG) in generative models, CAG reweights the action distribution to amplify the influence of the language instruction.
Mathematical Formulation:
The policy is defined as a linear combination of a conditional policy ( $\pi_{cond}$ ) and an unconditional, vision-only policy ( $\pi_{uncond}$ ):
$\pi_{CAG}(a | o, l) = \pi_{uncond}(a | o, \emptyset) + \omega \cdot (\pi_{cond}(a | o, l) - \pi_{uncond}(a | o, \emptyset))$
Where $\omega$ is a guidance scale. In log-probability space, this corresponds to sharpening the language likelihood:
$P_{CAG}(a | o, l) \propto P(a | o) \cdot P(l | a, o)^\omega$
Implementation Strategies:
1. Training-Free (TF): Uses the standard VLA model for both branches. The unconditional branch is approximated by dropping the language input at inference time.
2. Vision-Action (VA) Prior: Trains a separate, language-unconditioned Vision-Action (VA) model to explicitly represent the visual prior $P(a|o)$ . This provides a cleaner baseline for subtraction, yielding stronger results.

3. Key Contributions

LIBERO-CF Benchmark: A standardized evaluation suite revealing that state-of-the-art VLAs suffer from severe counterfactual failures, often defaulting to training tasks with near-zero success on counterfactual instructions.
CAG Framework: A universal, inference-time solution that improves language grounding across diverse VLA architectures (OpenVLA, $\pi_0$ , $\pi_{0.5}$ ) without requiring architectural changes or additional demonstrations.
Empirical Validation: Comprehensive evidence showing that vision shortcuts are a spectrum influenced by model architecture and that CAG effectively shifts the action posterior toward language-conditioned behaviors.

4. Experimental Results

Simulation Experiments (LIBERO-CF)

Baseline Performance: Standard VLAs performed poorly on counterfactual tasks. For example, $\pi_{0.5}$ achieved only 30.8% grounding and 13.2% success on average for counterfactual instructions, while maintaining >60% success on biased (training) tasks.
CAG Impact:
- Training-Free (TF): Improved average grounding from 30.8% to 40.5% and success to 16.8% on $\pi_{0.5}$ .
- VA Prior: Further improved grounding to 46.3% and success to 21.7%.
- Bias Reduction: CAG significantly reduced the "Biased" execution rate (defaulting to training tasks), proving it successfully suppresses visual shortcuts.
- Generalization: CAG improved performance on OOD tasks (unseen objects) by 8.5% in success rate when paired with a VA model.

Real-World Experiments

Setup: Evaluated on a Franka Research 3 robot with diverse tasks (Object Recognition, Spatial Reasoning, Goal Targeting, OOD, Long-Horizon).
Findings:
- Without CAG, robots frequently failed to distinguish between similar objects (e.g., Coke vs. Sprite) or spatial locations (Left vs. Right), defaulting to the training object.
- CAG Performance:
  - Reduced counterfactual failures by 9.4% on average.
  - Improved task success by 17.2% on average.
  - Achieved 100% grounding across all instructions in specific object recognition scenarios.
  - Successfully mitigated failures in long-horizon reasoning (e.g., correctly pouring the instructed liquid rather than the training liquid).

5. Significance and Conclusion

This paper highlights a fundamental limitation in current VLA research: the over-reliance on visual priors at the expense of language understanding. The introduction of LIBERO-CF provides a necessary diagnostic tool to expose these "blind spots" in robot policies.

The proposed CAG method is significant because it offers a training-free, architecture-agnostic solution. It allows existing VLA models to be "steered" toward better language following simply by adjusting the inference-time guidance scale and utilizing a vision-only prior. This approach bridges the gap between the theoretical generalization capabilities of pretrained VLMs and the practical reliability required for real-world robotic agents, ensuring that robots act on intent rather than just visual habit.