Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation
This paper introduces Concept-Gated Visual Distillation (CGVD), a training-free, model-agnostic inference framework that overcomes the "Precision-Reasoning Gap" in Vision-Language-Action models by parsing instructions to identify distractors and using Fourier-based inpainting to generate clean observations, thereby significantly improving robotic manipulation success rates in highly cluttered environments.