RetoVLA: Reusing Register Tokens for Spatial Reasoning in Vision-Language-Action Models
This paper introduces RetoVLA, a lightweight Vision-Language-Action model that enhances spatial reasoning and real-world robotic performance by repurposing discarded register tokens to inject global spatial context into the action-planning module without increasing parameter counts.