Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement

Imagine you have a brilliant, multilingual librarian named Foundation Model. This librarian has read every book in the world and can describe a picture of a hand in perfect detail. However, if you ask her, "What is the exact angle of this finger?" she stammers. She might say, "It's bent a little bit," or "Maybe 20 degrees?" when the real answer is 6 degrees. She's terrible at giving you the precise numbers.

But here's the twist: The librarian actually knows the answer perfectly. She just doesn't know how to say it.

This paper, titled "Do Foundation Models Know Geometry?", is like a detective story where the authors prove that the librarian's brain (her internal "frozen features") is full of perfect geometric data, but her mouth (the text generator) is the bottleneck.

Here is the breakdown of their discovery using simple analogies:

1. The "Silent Genius" vs. The "Chatty Fool"

The researchers tested 14 different AI models (the librarians) on tasks like measuring hand angles, head poses, and object positions.

The Text Problem: When they asked the models to talk about the angles, the answers were messy. The best text answer was off by 20 degrees. It's like asking a master carpenter to guess the length of a board with their eyes closed; they might get close, but they'll be wrong.
The "Silent" Truth: The researchers then bypassed the mouth entirely. They plugged a tiny, simple math tool (a "linear probe") directly into the model's brain. Suddenly, the model gave the answer with an error of only 6 degrees.
The Analogy: Imagine a person who can solve a complex math equation in their head instantly but can only speak in vague riddles. If you ask them to write it down, they fail. But if you put a pen directly in their hand and let them write without speaking, they get it right. The knowledge was there all along; the speech was the problem.

2. The "Translator" Fix (LoRA)

The authors tried to fix the "mouth" problem. They didn't retrain the whole giant brain (which is expensive and slow). Instead, they added a tiny, lightweight adapter called LoRA.

What happened: This tiny adapter acted like a specialized translator. It taught the model how to route the perfect geometric data from its brain directly to its mouth without losing any detail.
The Result: The text answers jumped from 20 degrees off to just 6.5 degrees off. It proved that the model didn't need to learn geometry; it just needed to learn how to access the geometry it already had.

3. The "Different Roads to the Same Mountain"

One of the most fascinating findings is about the models themselves. The researchers tested models built in completely different ways:

Some learned by matching pictures to words (like CLIP).
Some learned by just looking at pictures and guessing what's next (like DINOv2).
Some were built like old-school CNNs (like ConvNeXt).

The Discovery: Even though these models look different internally (like a Ferrari, a truck, and a bicycle), they all ended up with the same level of geometric accuracy when you probed them.

The Analogy: Imagine five different hikers taking five different trails up a mountain. One takes a steep path, one takes a winding road, and one flies a drone. When they reach the summit, they all have the exact same view. The paper calls this "Functional Convergence without Representational Convergence." In plain English: Different brains, different wiring, but they all "see" the shape of the world in the exact same way.

4. The "Spotlight" Effect

The paper also found that where the model looks matters.

Loose Photos: If you take a photo of a face in a wide room, the model needs to focus specifically on the face patches to get the head angle right. If you remove the face patches, the model gets confused.
Tight Photos: If you take a photo of a toy car that fills the whole frame, the model doesn't need to focus on one spot; the geometry is everywhere.
The Lesson: The model's attention is like a spotlight. For some tasks, you need to move the spotlight; for others, the whole stage is lit up.

Why Does This Matter? (The "So What?")

Before this paper, if you wanted an AI to measure hand angles for a robot or a medical app, you had to build a brand-new, expensive, specialized AI just for that job.

This paper says: "Stop building new tools! You already have the tool."

You can take a giant, pre-trained AI model (which companies already have).
Add a tiny, cheap "probe" (about 6,000 parameters—tiny compared to the billions in the main model).
And suddenly, that giant model can measure hands, heads, and objects with high precision.

The Bottom Line

Foundation models are geometric geniuses trapped in a text-speaking body. They know the exact angles of your fingers and the position of your head, but they struggle to say it out loud. By using a simple "probe" or a tiny "translator" (LoRA), we can unlock this hidden superpower without needing to retrain the whole system.

It's like realizing your smartphone has a built-in laser level, but you've been trying to use it as a flashlight this whole time. You just needed to flip the switch.

1. Problem Statement

Vision-Language Models (VLMs) are increasingly used for quantitative visual tasks, yet practitioners often receive imprecise answers (errors of 20°–39°) when prompting them for continuous physical measurements like joint angles or object pose.

The Core Question: Does this inaccuracy stem from a fundamental lack of geometric information in the model's visual representations (representation deficit), or is it merely a failure of the text generation pathway to access and express that information (pathway-training deficit)?
The Gap: Previous studies diagnosed that VLMs discard geometric details during text generation but lacked constructive solutions for extracting continuous measurements directly from frozen features.

2. Methodology

The authors systematically probe the frozen features of 14 foundation models (spanning self-supervised, contrastive, hybrid, and generative VLMs) to predict continuous geometric quantities without fine-tuning the backbone.

Datasets:
- FreiHAND: 3D hand pose (21 keypoints, 5 fingers).
- BIWI: Head pose (yaw, pitch, roll).
- YCB-Video: Rigid object 6DoF pose.
- MPIIFaceGaze: Gaze direction.
Probing Technique:
- Linear Probes: They extract hidden activations from specific layers, mean-pool spatially, and apply Reduced-Rank Ridge Regression (RRR) to map features to continuous targets (e.g., joint angles in degrees).
- Hyperparameters: Rank ( $r \in \{3, 4, 5, 6, 8\}$ ) and regularization ( $\alpha$ ) are swept.
- Evaluation: Nested 10-fold Cross-Validation (CV) to ensure robustness. Metrics include Mean Absolute Error (MAE) and $R^2$ .
Comparative Regimes:
1. Frozen Probe: Direct linear readout from frozen features.
2. Text Generation: Standard prompting (zero-shot, few-shot, Chain-of-Thought).
3. LoRA Fine-tuning: Low-rank adaptation ( $r=16$ ) on the text decoder to see if it can learn to route existing geometric signals.
4. Task-Specific Baselines: Comparison with dedicated models (e.g., MediaPipe, 6DRepNet).

3. Key Contributions & Findings

A. The "Text Bottleneck" is a Training Deficit, Not a Representation Deficit

The Gap: Frozen linear probes achieve 6.1° MAE on hand joint angles, whereas the best text output (few-shot prompting) achieves only 20.0° MAE. This represents a 3.3× performance gap.
LoRA Recovery: Fine-tuning the text pathway with LoRA (using only 2,000 images) narrows the gap to 6.5° MAE, nearly matching the frozen probe.
Conclusion: The geometry is fully encoded in the frozen features; the text pathway simply fails to "read" it without specific training. The bottleneck is the lack of training to route geometric signals through the autoregressive decoder.

B. Training Objective > Architecture

Functional Convergence: Five diverse encoders (SigLIP 2, DINOv3, CLIP, SigLIP, InternViT) converge to statistically equivalent accuracy ( $R^2 \approx 0.55$ ) on hand pose, despite having low representational similarity (CKA as low as 0.41).
Architecture vs. Pretraining: A controlled ablation comparing DeiT3-L (ViT) and ConvNeXt-L (CNN) with matched pretraining showed negligible architectural differences. However, models trained with self-supervised/contrastive objectives outperformed supervised-only models by ~0.15 $R^2$ .
Implication: The training objective (self-supervised/contrastive) is the primary driver of geometric encoding, not the specific architecture (ViT vs. CNN).

C. Spatial and Task Dependence

Spatial Concentration: Geometric information is not uniformly distributed.
- Loosely framed images (Head Pose): Information is concentrated in face patches. Removing high-norm patches significantly degrades performance. Attention pooling provides massive gains (+0.36 $R^2$ ).
- Tightly cropped images (Object Pose): Information is distributed across all patches. Attention pooling offers no benefit.
Layer Trajectory:
- Vision Encoders: Geometric signal builds monotonically, peaking in mid-to-deep layers (L16–L20).
- LLM Decoders: Geometric signal peaks early and declines monotonically, confirming that autoregressive processing discards fine-grained articulated geometry.

D. Functional Convergence Without Representational Convergence

The study extends the Platonic Representation Hypothesis. It demonstrates that different models can learn distinct internal representations (low CKA similarity) that nonetheless converge to the same functional capability (high geometric probing accuracy). This suggests a "weak" form of the hypothesis where functional equivalence does not require representational alignment.

4. Results Summary

Metric	Frozen Linear Probe	Text Output (Few-shot)	LoRA Fine-tuned Text	Dedicated Model (MediaPipe)
Hand Pose MAE	6.1°	20.0°	6.5°	16.3°*
Hand Pose $R^2$	0.559	N/A	0.400	-2.44
Head Pose $R^2$	0.607 (DINOv3)	N/A	N/A	0.532 (6DRepNet)*

*Note: MediaPipe and 6DRepNet comparisons involve different evaluation protocols (monocular vs. multi-view ground truth).

Cross-Dataset: The "equivalence cluster" of top models holds for hand pose but dissolves for head pose (where DINOv3 leads) and gaze, indicating task-specific optimal encoders.
Camera Intrinsics: Frozen features also encode camera focal length ( $R^2 \approx 0.81–0.94$ ), suggesting a multi-task geometric capability.

5. Significance and Practical Implications

Scientific Insight: The paper definitively proves that foundation models "know" geometry in their latent space but fail to express it via text. This shifts the research focus from "improving representations" to "improving readout mechanisms."
Practical Deployment (Modular Sensing):
- A single frozen backbone can serve as a multi-task geometric sensor.
- Cost Efficiency: Adding a new geometric task (e.g., hand pose) requires only ~6,000 probe parameters and ~6,400 labeled images, compared to training a full task-specific model (millions of parameters).
- Human-Readable Output: LoRA fine-tuning allows the model to output these measurements in natural language with near-optimal accuracy, bridging the gap between raw sensor data and user interaction.
Methodological Shift: The authors advocate for linear probing as a standard tool for evaluating the geometric and physical understanding of foundation models, moving beyond discrete classification benchmarks.

Conclusion

The paper concludes that foundation models possess rich, continuous geometric representations that are currently underutilized due to the limitations of the text generation pathway. By using lightweight probes or LoRA, these models can be transformed into highly accurate, multi-task geometric sensors, offering a cost-effective alternative to training dedicated models for every new physical measurement task.