Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos
This paper introduces a large-scale framework for Vision-and-Language Navigation that leverages web-based room tour videos and implicit geometry representations to overcome simulator limitations, enabling robust zero-shot navigation agents with state-of-the-art performance across multiple benchmarks.