A Mixed Diet Makes DINO An Omnivorous Vision Encoder

The Big Idea: Teaching a Specialist to Eat Everything

Imagine you have a brilliant art critic named DINO. DINO is famous for looking at color photographs (RGB images) and understanding exactly what is in them. If you show DINO a picture of a cat, it knows it's a cat. If you show it a sunset, it knows it's a sunset. DINO is a "specialist" who only eats one type of food: color photos.

However, the real world isn't just color photos. Sometimes we see things in black and white, sometimes we see depth maps (which look like heat maps showing how far away things are), and sometimes we see segmentation maps (which look like coloring books where every object is a different solid color).

The Problem:
If you show DINO a color photo of a cat and then show it a depth map of the same cat, DINO gets confused. To DINO, the color photo and the depth map look like two completely different, unrelated things. It's as if DINO thinks the depth map is a picture of a toaster, even though it's the same cat.

The researchers asked: Can we teach DINO to understand that a color photo, a depth map, and a segmentation map are all just different ways of describing the same scene?

The Solution: The "Omnivorous" Diet

The team created a new version of DINO called the Omnivorous Vision Encoder. Think of this as teaching DINO to become an omnivore—an eater that can digest many different types of food (modalities) and still recognize the same meal.

Here is how they did it, using three main ingredients:

1. The "Twin" Strategy (Teacher-Student)

They didn't want to retrain DINO from scratch because that would be like forgetting everything DINO already knew. Instead, they created a Student version of DINO.

The Teacher: The original, frozen DINO (who only knows color photos).
The Student: A copy of DINO that is allowed to learn new things.

The Student is told: "You must look at the depth map and the color map, and make sure your brain sees them as the same thing. But, you must also keep your original knowledge about what a cat looks like."

2. The "Anchoring" Rope

There was a risk. If they just told the Student to match the depth map to the photo, the Student might get lazy. It might decide, "Okay, I'll just turn everything into a blurry gray blob. That way, everything looks the same!" This is called "collapsing" the feature space.

To stop this, they used an Anchoring Loss. Imagine the Student is a kite flying in the wind (trying to align different images). The Teacher is a heavy anchor on the ground.

The wind tries to pull the kite to match the new images.
The anchor (the Teacher) pulls back, saying, "Don't forget what a cat actually looks like!"
The result is a kite that flies high enough to catch new winds (new image types) but stays tethered to the ground (original knowledge).

3. The "Smoothie" Training (Data Augmentation)

This is the most creative part. The researchers realized that if they just showed the Student a pure depth map and a pure photo, the Student might cheat. It might just learn to recognize the colors used in the depth map (e.g., "Oh, blue means far away") rather than the actual shape of the object.

To stop cheating, they created a "Mixed Diet":

They took a depth map and blended it with a color photo, like making a smoothie.
They took a segmentation map and mixed it with a color photo.
They did this randomly, creating a continuous spectrum of images that were half-depth, half-photo, or 80% photo, 20% depth.

Why? This forced the Student to stop looking at superficial things like "is this image blue or red?" and start looking at the structure (the shape of the cat). It taught the model that the shape is the most important thing, regardless of whether it's painted in color, depth, or black-and-white.

The Results: A Super-Model

After this training, the new "Omnivorous DINO" became a superhero of computer vision:

Cross-Modal Retrieval: If you search for a scene using a depth map, it can instantly find the matching color photo in a database. Before, it was like searching for a book by its cover color and getting the wrong book. Now, it finds the right book every time.
Zero-Shot Transfer: This is the coolest trick. They trained a "depth predictor" (a tool that guesses how far away things are) using only color photos. Then, they fed it a segmentation map (a coloring book style image) it had never seen before.
- Old DINO: "I don't know what this is! I can't guess the depth!"
- Omnivorous DINO: "Oh, I see a cat. I know cats are 2 meters away. Here is the depth map."
- It worked because the model learned that the structure of the cat is the same, even if the input "food" was different.

The Takeaway

The paper shows that by feeding a powerful AI model a "mixed diet" of different image types (photos, depth, segmentation) and carefully balancing new learning with old knowledge, we can create a universal visual brain.

This brain doesn't care if you show it a photo, a 3D scan, or a sketch. It sees the world consistently, just like a human does, whether they are looking at a scene in daylight, in the dark, or through a pair of glasses. It turns a specialist into a generalist without losing its expertise.

1. Problem Statement

Pre-trained vision foundation models like DINOv2 have achieved state-of-the-art performance on unimodal tasks (specifically RGB images). However, the authors identify a critical limitation: these models lack modality invariance.

The Issue: Feature representations for the same scene captured in different modalities (e.g., RGB vs. Depth vs. Segmentation) are poorly aligned in the feature space.
Evidence: The cosine similarity between the feature embeddings of an RGB image and its corresponding depth map is nearly identical to the similarity between two completely random, unrelated images.
Consequence: This misalignment prevents robust cross-modal understanding, such as retrieving a depth map using an RGB query, or training a task head on one modality (e.g., RGB) and deploying it on another (e.g., Segmentation) without retraining.

2. Methodology

The authors propose the Omnivorous Vision Encoder, a framework that learns a modality-agnostic feature space by adapting an existing foundation model (DINOv2) rather than training a new model from scratch.

A. Architecture: Parameter-Efficient Teacher-Student Framework

Teacher: A frozen DINOv2 backbone ( $f^*$ ) with its original head ( $g^*$ ). It provides a stable, high-quality semantic anchor.
Student: A modified encoder initialized from the teacher. It shares the frozen backbone layers but introduces a trainable adapter module ( $g$ ) at the final high-level processing blocks.
Mechanism: The student processes inputs from various modalities (RGB, Depth, Segmentation) through the frozen backbone and the trainable adapter to produce a unified embedding.

B. Data-Centric Innovations

To prevent the model from learning trivial alignment solutions (e.g., relying on color histograms), the authors introduce two specific data processing strategies:

Natural Colorization: Instead of using standard colormaps (like grayscale or jet) for depth/segmentation, the authors derive a color palette from the corresponding RGB image and apply it to the structural maps. This forces the network to align based on structural content rather than superficial channel statistics.
Modality Mixup: During training, RGB, depth, and segmentation maps are randomly blended (e.g., $x_{mix} = (1-\alpha)x_{depth} + \alpha x_{RGB}$ ). This creates a continuous spectrum of modalities, encouraging the encoder to learn invariance across the entire range of visual inputs rather than treating them as discrete categories.

C. Loss Functions

The training objective combines two competing goals:

Symmetric Cross-Modal Alignment ( $L_{align}$ ): Uses InfoNCE loss to maximize the similarity between embeddings of different modalities from the same scene (positive pairs) while minimizing similarity with different scenes (negative pairs).
Anchoring Loss ( $L_{anchor}$ ): A distillation loss that minimizes the cosine distance between the student's output and the frozen teacher's output for the same modality. This prevents "representational collapse" and ensures the model retains the discriminative power of the original foundation model.

Total Loss: $L_{total} = L_{align} + \lambda_{anchor} L_{anchor}$ . The hyperparameter $\lambda_{anchor}$ balances alignment vs. semantic fidelity.

3. Key Contributions

Omnivorous Encoder: A novel method to transform a unimodal foundation model (DINOv2) into a modality-agnostic encoder capable of processing RGB, Depth, and Segmentation with consistent embeddings.
Efficient Adaptation: Unlike previous "unified" models that require training massive backbones from scratch (e.g., Omnivore, ImageBind), this approach uses a lightweight adapter on top of a frozen backbone, significantly reducing computational cost and data requirements.
Hard Positive Construction: The introduction of natural colorization and modality mixup to create "hard positives," forcing the model to learn deep structural alignment rather than superficial correlations.
Zero-Shot Cross-Modal Transfer: Demonstrating that a task head trained on RGB data can be directly applied to Depth or Segmentation inputs without retraining, a capability previously unavailable in standard foundation models.

4. Experimental Results

The authors evaluated the Omnivorous encoder against the frozen DINOv2 baseline across several tasks:

Cross-Modal Retrieval:
- On the ScanNet dataset, the baseline DINOv2 had a Median Rank (MedR) of ~401 (random performance), while the Omnivorous encoder achieved a MedR of 2.0.
- Recall@1 improved from 4.6% to 46.1% on ScanNet and from 15.5% to 86.2% on the synthetic MOVi dataset.
Downstream Tasks (Linear Probes):
- Monocular Depth: Outperformed DINOv2 on NYUv2 (RMSE reduced from 0.405 to 0.377).
- Semantic Segmentation: Improved mIoU on ADE20k (0.463 $\to$ 0.475) and Cityscapes.
- Classification: Improved ImageNet-1k top-1 accuracy from 80.4% to 83.8%, suggesting that aligning structural modalities enriches the semantic density of the feature space.
Zero-Shot Transfer:
- A depth prediction head trained only on RGB images was tested on Segmentation and NOCS (Normalized Object Coordinate Space) inputs.
- The baseline DINOv2 failed catastrophically (RMSE ~1.5m on Segmentation), while the Omnivorous encoder maintained high performance (RMSE ~0.53m), proving true modality invariance.
Ablation Studies:
- The $\lambda_{anchor}$ parameter successfully controls the trade-off between alignment and discriminability.
- Natural colorization and mixup were shown to be critical for preventing the model from collapsing to trivial solutions.

5. Significance and Conclusion

This work addresses a fundamental gap in current vision foundation models: the inability to generalize across different visual representations of the same physical scene.

Practical Impact: It enables robust cross-modal applications (e.g., using depth sensors to query an RGB-trained retrieval system) without the need for expensive, large-scale multi-modal pre-training.
Philosophical Shift: The paper argues that vision models should evolve from "specialized" encoders to "omnivorous" ones, similar to how NLP moved from language-specific models to shared multilingual embeddings.
Future Directions: The authors suggest that while post-hoc adaptation works well, future work could explore aligning modalities during the initial pre-training phase to unlock even deeper generalization.

In summary, the paper demonstrates that with a "mixed diet" of aligned data and a lightweight adapter, a unimodal encoder can be transformed into a powerful, modality-agnostic vision foundation model.