2D or 3D: Who Governs Salience in VLA Models? -- Tri-Stage Token Pruning Framework with Modality Salience Awareness

This paper proposes a tri-stage token pruning framework for multi-visual-modal Vision-Language-Action (VLA) models that dynamically captures 2D/3D modality salience differences to achieve up to a 2.55x inference speedup with minimal accuracy loss.

Original authors: Zihao Zheng, Sicheng Tian, Zhihao Mao, Lingyue Zhang, Chenyue Li, Ziyun Zhang, Hong Gao, Yuchen Huang, Yutong Xu, Guojie Luo, Xiang Chen

Published 2026-04-13
📖 5 min read🧠 Deep dive

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Picture: The Robot's Overloaded Brain

Imagine a robot that needs to learn how to do chores, like closing a fridge or picking up a banana. To do this, it uses a "Brain" called a VLA Model (Vision-Language-Action).

  • The Old Way (2D Only): The robot used to look at the world like a person looking at a photograph. It saw flat images. This was okay, but it was hard for the robot to judge depth (how far away the fridge handle is).
  • The New Way (2D + 3D): To get better, scientists gave the robot a second pair of eyes that sees in 3D (like a depth map or a point cloud). Now, the robot can see exactly where things are in space.

The Problem: Giving the robot 3D vision is like giving it a superpower, but it comes with a heavy price. The robot now has to process twice as much information (flat photos + 3D data). It's like trying to read a book while someone is shouting a second story in your ear. The robot gets overwhelmed, thinks too slowly, and can't move fast enough to be useful in the real world.

The Solution: The "Tri-Stage" Smart Filter

The authors of this paper realized that existing methods to speed up robots were like using a blunt knife: they just chopped out random pieces of information to save time. But this was dangerous because the robot might cut off the important 3D data it needed to grab a cup, or the important 2D data it needed to read a label.

They asked a simple question: "Who is the boss right now? Is the 2D image more important, or is the 3D depth map more important?"

They found that the answer changes depending on what the robot is doing and where it is in time. So, they built a Tri-Stage Token Pruning Framework. Think of this as a highly intelligent bouncer at a club who decides who gets to stay in the VIP room (the robot's brain) and who gets kicked out, based on three different rules.


The Three Stages of the "Bouncer"

Stage 1: The Data Preprocessing (The "Raw Material" Check)

  • The Analogy: Imagine you are a chef preparing a meal. You have a pile of vegetables (2D images) and a pile of spices (3D data).
  • The Insight: The researchers found that for some tasks, the vegetables are the main dish, and the spices are just a garnish. For other tasks, the spices are the main flavor.
  • The Action: The bouncer looks at the raw ingredients. If the 3D data is just "noise" (like extra background clutter), the bouncer throws it away immediately. If the 2D image is blurry but the 3D shape is clear, the bouncer keeps the 3D and tosses the 2D.
  • Result: They set two "thresholds" (rules) to decide: "Keep only 2D," "Keep only 3D," or "Keep both."

Stage 2: The Semantic Synthesis (The "Context" Check)

  • The Analogy: Now the robot is looking at a specific scene, like a kitchen counter. The scene has three parts: the Background (the wall), the Robot Arm (the tool), and the Target Object (the banana).
  • The Insight:
    • Background: The wall doesn't need 3D depth. It's just a flat wall. Throw away the 3D data here!
    • Robot Arm: The arm needs to know its own shape in 3D to avoid crashing. Keep the 3D!
    • Target (Banana): To pick up the banana, you need to see its color (2D) and its shape (3D). Keep both!
  • The Action: The bouncer divides the scene into these three zones. In the "Background" zone, they aggressively cut out 90% of the data. In the "Target" zone, they protect everything.

Stage 3: The Action Iteration (The "Time" Check)

  • The Analogy: Imagine the robot is reaching for the banana.
    • Step 1: The arm is far away. It mostly needs to see the general shape (3D).
    • Step 2: The arm is close. It needs to see the texture of the banana peel to know if it's ripe (2D).
    • Step 3: The arm grabs. It needs precise depth again.
  • The Insight: The importance of 2D vs. 3D changes moment by moment. If the robot decides what to keep based only on the current second, it might get "jittery" (switching back and forth too fast).
  • The Action: The bouncer uses a "Sliding Window" (like looking through a history book). It remembers what happened in the last few seconds. If the robot was relying on 3D a moment ago, it assumes it probably still needs it, smoothing out the decision so the robot doesn't panic and switch modes too quickly.

The Magic Result

By using this three-step smart filter, the robot can:

  1. Throw away the junk: It removes up to 70-80% of the unnecessary data.
  2. Keep the gold: It keeps the exact 2D or 3D data needed for the specific moment and specific object.
  3. Run faster: The robot becomes 2.55 times faster.

The Bottom Line:
Before this paper, robots with 3D vision were too slow to be practical. This new method is like giving the robot a pair of sunglasses that automatically tint the right lens (2D or 3D) depending on what it's looking at. The robot is now fast enough to do real-time tasks without losing its ability to see the world clearly.

In short: They taught the robot when to look with its 2D eyes, when to look with its 3D eyes, and when to close one eye to save energy, all without dropping the banana.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →