Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement
This paper demonstrates that frozen vision-language model features contain rich, continuous geometric information that outperforms text-based outputs by 3.3x, revealing that the accuracy bottleneck stems from training objectives and autoregressive generation rather than representational limitations, as evidenced by high-precision linear probes and consistent performance across diverse encoder architectures.