Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM

Imagine you are trying to describe a complex painting to a friend over the phone.

The Old Way (CLIP): Your friend has a camera that only takes one giant, blurry photo of the whole painting. They can tell you, "It's a landscape with a house," but if you ask, "What color is the door?" or "How many windows are on the roof?", they have to guess because the details are lost in the blur.
The Other Way (DINOv3): Your friend has a super-microscope. They can see every single brushstroke and the texture of the wood. But if you ask, "Is this a house or a barn?" they get confused because they are so focused on the tiny details that they miss the big picture.

Granulon is the new, super-smart assistant that solves this problem. It combines the best of both worlds by acting like a chameleon camera that can instantly change its zoom level based on what you ask.

Here is a simple breakdown of how it works:

1. The Problem: The "Zoom" Dilemma

Current AI models are stuck in a rut.

Some models are great at understanding the big picture (global semantics) but terrible at spotting small details.
Others are amazing at seeing tiny details (pixel-level) but struggle to understand the overall story or context.
Trying to use both types of cameras at once is slow and expensive, like hiring two photographers to take the same photo.

2. The Solution: The "Granulon" Camera

The researchers built a new AI called Granulon. Instead of forcing the AI to choose between "zoomed out" and "zoomed in," Granulon has a smart remote control that changes the zoom level instantly, depending on your question.

The Two Magic Parts:

A. The "Question Detective" (The Controller)
Think of this as a smart assistant who listens to your question first.

If you ask, "What kind of animal is in the picture?" (A big-picture question), the Detective tells the camera to zoom out to see the whole scene.
If you ask, "What color is the dog's ear?" (A tiny detail question), the Detective tells the camera to zoom in tight to see the fur texture.
It decides the perfect "level of detail" before the AI even starts looking.

B. The "Smart Summarizer" (AdaTA)
Once the camera zooms to the right level, this part cleans up the information.

Imagine looking at a forest. If you zoom out, you don't need to see every single leaf; you just need to see the "tree." If you zoom in, you need to see the "leaf."
The Smart Summarizer groups similar pixels together into neat, compact "tokens" (little chunks of information). It throws away the noise and keeps only the most important details, ensuring the AI doesn't get overwhelmed by too much data.

3. Why This Matters: Less "Hallucination"

One of the biggest problems with AI today is hallucination—making things up.

If an AI is too focused on the "big picture," it might guess, "There's a cat on the roof," just because it sees a shape that looks like a cat.
If it's too focused on details, it might get lost in the weeds and forget the context.

Granulon fixes this by matching the detail level to the question. Because it sees the exact right amount of detail, it is much less likely to make things up. The paper shows it reduces these "made-up" answers by about 20% and gets the right answer about 30% more often than previous models.

The Analogy: The Detective and the Magnifying Glass

Imagine a detective solving a crime.

Old AI: The detective either looks at the crime scene from a helicopter (missing the clues on the floor) or looks through a microscope at a single speck of dust (missing the whole room).
Granulon: The detective has a magic magnifying glass.
- When the detective needs to know who was in the room, the glass zooms out to show the whole room.
- When the detective needs to know what was written on a note, the glass zooms in to read the handwriting.
- The detective switches between these views instantly, based on the specific clue they are looking for.

The Bottom Line

Granulon teaches AI to be flexible. It stops forcing the computer to be either a "big picture thinker" or a "detail-oriented worker." Instead, it lets the AI be both, switching gears instantly to give you the most accurate, truthful, and detailed answer possible. It's like giving the AI a pair of glasses that automatically adjust their prescription to whatever you are looking at.

Here is a detailed technical summary of the paper "Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM".

1. Problem Statement

Multimodal Large Language Models (MLLMs) currently face a fundamental trade-off in visual encoding:

CLIP-based Encoders: These models excel at global semantic alignment and cross-modal understanding but struggle with fine-grained visual details (e.g., textures, specific object attributes). They tend to produce coarse representations that lead to information loss in pixel-level reasoning.
Pixel-Level Encoders (e.g., DINOv3): These models (specifically self-supervised ones like DINOv3) possess exceptional pixel-level perception and relational structure understanding. However, they lack coarse-grained semantic abstraction, making them less effective at high-level reasoning or tasks requiring global context.
The Gap: Existing solutions often rely on multi-encoder architectures (combining CLIP and DINO), which are computationally expensive and fail to unify coarse-to-fine reasoning within a single encoder. There is a need for a mechanism that allows a single pixel-level encoder to dynamically adapt its abstraction level based on the input query.

2. Methodology: Granulon

The authors propose Granulon, a novel MLLM architecture built upon the DINOv3 visual encoder. Instead of relying on fixed global features, Granulon introduces adaptive granularity augmentation to unlock the latent semantic capacity of pixel-level encoders. The architecture consists of two core modules:

A. Text-Conditioned Granularity Controller

Function: This module dynamically predicts the optimal visual abstraction level (granularity) based on the linguistic complexity and referential scope of the textual input (the question).
Mechanism:
- It takes the text embeddings ( $T_e$ ) from the first layer of the LLM.
- It aggregates these embeddings via a pooling and projection layer ( $\Psi_{agg}$ ) to create a compact textual descriptor.
- An MLP head ( $\Phi_{MLP}$ ) maps this descriptor to a probability distribution over a predefined set of granularity hypotheses ( $\pi_\Theta = \{g_k\}$ ).
- Each hypothesis $g_k$ defines parameters for spatial down-sampling ( $\alpha_k$ ) and token cluster cardinality ( $\beta_k$ ).
Outcome: The model learns to select "coarse" settings for global questions (e.g., "What is the scene?") and "fine" settings for detail-oriented questions (e.g., "What color is the dog's ear?").

B. Adaptive Token Aggregation (AdaTA)

Once the granularity parameters are determined, the AdaTA module processes the visual features from DINOv3 to generate semantic tokens. It operates in three stages:

Granularity-Guided Pooling: The visual feature map is spatially down-sampled using a kernel size determined by the Controller's $\alpha^*$ . This aligns the token resolution with the desired abstraction level.
Feature Clustering: A mini-k-means process groups the pooled features into clusters. The number of clusters is controlled by $\beta^*$ . Crucially, the clustering considers both spatial proximity and attention patterns (relational coherence) to ensure each cluster represents a semantically consistent region.
Feature Refinement & Selection: A quality score is computed for each cluster based on spatial support, semantic homogeneity, and dispersion. The top- $K$ clusters are selected to form the final semantic tokens.

C. Training Objective

Granulon is trained to maximize the joint likelihood of two complementary token streams:

Pixel-level contribution: Ensuring the model retains fine-grained details.
Granularity contribution: Ensuring the aggregated semantic tokens provide robust global understanding.
The loss function combines the standard task loss with regularizers that encourage informative pixel and semantic tokens, balanced by coefficients $\lambda_d$ and $\lambda_t$ .

3. Key Contributions

New Paradigm for MLLMs: The paper identifies a new direction of enhancing pixel-level encoders (like DINOv3) with coarse-grained abstraction capabilities, rather than relying solely on CLIP-based semantic encoders.
Granulon Architecture: Introduction of a unified framework that integrates a Text-Conditioned Controller and Adaptive Token Aggregation (AdaTA) to dynamically adjust visual abstraction from "pixel" to "fine" to "coarse" in a single forward pass.
Performance Gains: Demonstrates that adaptive granularity significantly improves reasoning accuracy and reduces hallucinations compared to static encoders.
Interpretability: Provides analysis showing that Granulon achieves better layer-wise alignment with the LLM compared to baselines, confirming that the improvement stems from the adaptive granularity mechanism rather than model bias.

4. Experimental Results

The authors evaluated Granulon across diverse benchmarks (VQA, Image Captioning, Reasoning, and Medical Domain) using identical settings for fair comparison against CLIP, SigLIP, DINOv2, and DINOv3 baselines.

Accuracy Improvements:
- VQA (SEED-Bench & A-OKVQA): Granulon improved accuracy by ~30% compared to CLIP baselines (e.g., 58.8% vs 50.91% on SEED-Bench with Qwen2.5).
- Reasoning (FLUX-Reason): Achieved a GPT-score of 56.67% (with Llama3), outperforming DINOv2 (19.49%) and CLIP (28.97%) by significant margins.
- Medical Domain (SurgVLM): Showed superior performance in phase and instrument recognition, reaching 97.32% and 97.95% BERT scores.
Hallucination Reduction:
- Granulon reduced hallucination rates by ~20% compared to baselines.
- In reasoning tasks with Llama3, the hallucination rate dropped from 61.3% (DINOv3) to 46.3% (Granulon), a relative reduction of 46%.
Efficiency: While the adaptive controller introduces a slight increase in token consumption (~10%), the overall performance gain (up to 39.7% improvement over vanilla DINOv3) far outweighs the cost, proving that text-adaptive selection is more critical than raw token quantity.

5. Significance

Unifying Perception and Reasoning: Granulon successfully bridges the gap between low-level pixel perception and high-level semantic reasoning within a single encoder, eliminating the need for computationally heavy multi-encoder setups.
Mitigating Hallucinations: By grounding the LLM in adaptive, multi-scale visual evidence, the model is less prone to "over-activating" internal semantic priors that lead to hallucinations.
Future Direction: The work suggests that the future of MLLMs lies not just in larger models, but in dynamic, task-adaptive visual representations that can flexibly shift between detail-oriented and abstract reasoning modes based on user intent.

In conclusion, Granulon demonstrates that pixel-level encoders, when equipped with adaptive semantic control, can outperform established semantic encoders (like CLIP) in both accuracy and reliability, offering a promising path forward for robust multimodal reasoning.