AnyUp: Universal Feature Upsampling

Imagine you have a high-resolution, crystal-clear photograph of a bustling city. Now, imagine you have a famous art critic (a powerful AI model) who can look at that photo and tell you exactly what every single object is, where the edges are, and how deep the scene goes.

But here's the catch: The critic only looks at a tiny, blurry thumbnail of the photo. They give you a detailed report based on that small, fuzzy image. If you try to use that report to paint a masterpiece on a giant canvas, the details are all wrong. The buildings look like blobs, and the depth feels flat.

This is the problem computer vision scientists face every day. Modern AI models are brilliant, but they often process images in "chunks" (like a grid of low-resolution tiles) to save computing power. When we need to use their knowledge for tasks like self-driving cars or 3D mapping, we need those "chunky" reports to be stretched out to match the full, high-resolution image.

The Old Way: The "One-Size-Fits-None" Tailor

Previously, if you wanted to stretch these blurry reports back to high definition, you had to hire a specific tailor for every single type of critic.

If your critic was DINO, you needed a "DINO-stretcher."
If your critic was CLIP, you needed a "CLIP-stretcher."
If you got a brand new, super-smart critic tomorrow, your old stretchers wouldn't work. You'd have to fire them all and hire new ones, which is expensive and slow.

It's like having a suit that fits perfectly only if you are exactly 5'9" and weigh 160 lbs. If you change even an inch, the suit rips.

The New Way: AnyUp (The "Universal Translator")

The authors of this paper, AnyUp, have invented a Universal Feature Upsampler. Think of it as a magical, shape-shifting tailor who can take a blurry report from any critic, in any format, and instantly stretch it to match your high-resolution photo perfectly.

Here is how they did it, using some simple analogies:

1. The "Feature-Agnostic" Layer (The Universal Adapter)

Imagine you have a pile of different colored Lego bricks (the features from different AI models). Some are big, some are small, some are red, some are blue.

Old methods tried to sort the bricks by color first, which meant they only worked for one specific color.
AnyUp uses a special "Universal Adapter." It doesn't care what color the brick is. It looks at the shape and the structure of the pile. It says, "I don't need to know if this is a 'DINO' brick or a 'CLIP' brick; I just need to know how to arrange these shapes to build a clear picture." This allows it to work with any AI model out of the box.

2. Window Attention (The "Spotlight" Strategy)

When you try to stretch a blurry image, a common mistake is to look at the entire image to decide what a single pixel should be. This causes "ghosting" or blurring because the AI gets confused by distant, unrelated parts of the picture.

AnyUp uses a Spotlight. Instead of looking at the whole city, it puts a small window over one neighborhood. It asks, "What is happening right here?" It looks at the immediate neighbors to decide how to stretch the details. This keeps the edges sharp and prevents the "blurry halo" effect seen in older methods.

3. The "Crop" Training (The Puzzle Piece Teacher)

Training a model to stretch a whole 4K image is like trying to teach a student to solve a 10,000-piece puzzle all at once. It's too heavy and slow.

AnyUp uses a clever trick: It only shows the student small pieces of the puzzle (crops) at a time. It teaches the model to fix a small corner of the image perfectly. Because the rules of how to fix a corner are the same as how to fix the whole image, the model learns the skill quickly and efficiently, without needing a supercomputer.

Why Does This Matter?

It's "Plug-and-Play": You can train this model once, and then use it with any future AI vision model. You don't need to retrain it every time a new, smarter AI comes out.
It's Sharper: The results are much crisper. If you are trying to detect the edge of a car for a self-driving robot, AnyUp gives you a clean line, whereas older methods might give you a fuzzy cloud.
It's Efficient: It doesn't need a massive supercomputer to run. It's lightweight and fast.

The Bottom Line

AnyUp is like a universal translator for the visual world. It takes the "rough drafts" produced by powerful AI brains and instantly turns them into high-definition, pixel-perfect instructions that robots, cameras, and augmented reality glasses can actually use. It breaks down the barrier between "low-res thinking" and "high-res seeing," making advanced AI accessible to a much wider range of applications.

1. Problem Statement

Modern computer vision relies heavily on pre-trained vision encoders (e.g., DINO, CLIP, SigLIP, MAE) to extract semantic features. However, these models, particularly Vision Transformers (ViTs), output low-resolution feature maps (tokens) due to patchification and downsampling. This creates a bottleneck for downstream tasks requiring pixel-level precision, such as depth estimation, 3D reconstruction, and semantic segmentation.

Existing solutions for feature upsampling suffer from critical limitations:

Encoder-Specificity: Current learnable upsamplers (e.g., FeatUp, LoftUp, JAFAR) must be retrained for every specific feature extractor. They do not generalize to unseen feature types or dimensions at inference time.
Computational Cost: Retraining an upsampler for a new encoder requires querying the vision encoder multiple times per training sample, which is computationally expensive and often infeasible for large models.
Quality Issues: Non-learnable methods (e.g., Bilinear, Guided Filtering) often result in excessive smoothing, loss of semantic detail, or artifacts like halos.

The Goal: Develop a universal, encoder-agnostic feature upsampling method that can upsample features from any source model to any resolution without retraining, while preserving feature semantics and achieving state-of-the-art (SOTA) performance.

2. Methodology: AnyUp

AnyUp is a lightweight, attention-based architecture designed to be feature-agnostic at inference time. It takes a low-resolution feature map ( $p$ ) and a high-resolution RGB guidance image ( $I_{hr}$ ) to produce a high-resolution feature map ( $q$ ).

Key Architectural Components:

Feature-Agnostic Layer (Sec 4.1):
- Problem: Standard convolutional layers require fixed input channel dimensions, forcing retraining for different encoders.
- Solution: AnyUp introduces a novel layer that processes input channels independently using a learned filter basis $\{\psi_j\}$ .
- Mechanism: Each input channel $p_i$ is convolved with the basis filters. The resulting activations are passed through a softmax along the basis dimension and then averaged across all input channels.
- Result: The output dimension is fixed (canonical) regardless of the input feature dimensionality ( $N$ ), allowing the model to accept features from any encoder (e.g., DINO, CLIP, ResNet) without modification.
Local Window Attention (Sec 4.2):
- Problem: Global attention mechanisms (used in prior works like JAFAR) allow pixels to attend to vastly unrelated distant regions, introducing noise and making optimization difficult.
- Solution: AnyUp restricts the attention mechanism to local windows around the query point.
- Benefit: This simplifies the optimization objective, improves efficiency, and ensures that upsampling relies on local structural consistency rather than global correlations.
Training Pipeline & Data Sampling (Sec 4.3):
- Challenge: Training on full high-resolution images is infeasible because generating "ground truth" high-resolution features from large encoders is computationally prohibitive.
- Strategy: The authors employ a crop-based training strategy.
  - A high-resolution image is sampled.
  - A smaller local crop ( $I'$ ) is extracted.
  - The full image is downsampled to compute low-res features ( $p$ ).
  - The crop is processed to compute high-res "ground truth" features ( $\hat{q}$ ) at the crop's resolution.
  - The model is trained to upsample the low-res features of the full image to match the high-res features of the crop.
- Regularization: The loss function includes:
  - Cosine-MSE Loss: Minimizes distance between predicted and target features.
  - Self-Consistency: Ensures the upsampled features are consistent with themselves.
  - Input-Consistency: Ensures the upsampled features, when downsampled, match the original input features (preserving the feature space).

3. Key Contributions

Universal Applicability: AnyUp is the first learnable method that is feature-agnostic at inference time. It can be trained once (e.g., on DINOv2) and applied to features from any vision encoder (including unseen ones like SigLIP or DINOv3) without retraining.
Feature-Agnostic Layer: A novel architectural component that maps arbitrary input channel dimensions to a canonical space, enabling generalization across different feature types and dimensionalities.
Efficient Training & Inference: By using local window attention and crop-based sampling, AnyUp reduces memory and compute requirements by >50% compared to SOTA concurrent methods (LoftUp, JAFAR) while maintaining high performance.
State-of-the-Art Performance: AnyUp achieves SOTA results across multiple downstream tasks (semantic segmentation, depth estimation, surface normal estimation) and generalizes to arbitrary resolution changes (Any-to-Any).

4. Experimental Results

The authors evaluated AnyUp on ImageNet, COCO, ADE20k, PASCAL VOC, and NYUv2 datasets.

Semantic Segmentation: AnyUp achieved SOTA mIoU scores on COCO (62.16), PASCAL-VOC (84.00), and ADE20k (42.43), outperforming FeatUp, LoftUp, and JAFAR.
Geometric Tasks: In depth and surface normal estimation (NYUv2), AnyUp outperformed all competitors, demonstrating superior preservation of local feature structure (e.g., lower RMSE in depth estimation: 0.4755 vs. 0.4816 for FeatUp).
Generalization (Zero-Shot):
- A model trained on DINOv2 features successfully generalized to SigLIP, DINOv3, and ResNet features, often matching or exceeding the performance of models specifically trained on those encoders.
- Multi-Backbone Training: Training on a mix of encoders (DINOv2, CLIP, SigLIP) further improved generalization to unseen encoders (e.g., DeiT).
Feature Space Preservation: Unlike LoftUp, which distorts feature distributions, AnyUp preserves the original feature space. Linear probes trained on low-resolution features transferred directly to high-resolution AnyUp outputs without fine-tuning.
Qualitative Analysis: Visualizations (PCA projections) showed AnyUp produces sharp feature maps without the oversmoothing or halo artifacts seen in Guided Filtering or FeatUp.

5. Significance and Impact

Democratization of Feature Upsampling: AnyUp removes the barrier of needing to retrain upsamplers for every new vision foundation model. Researchers can now apply high-quality upsampling to any new encoder immediately.
Efficiency: The method is significantly more efficient than current SOTA, making it viable for resource-constrained environments or large-scale deployment.
Task Agnosticism: It works universally across diverse downstream tasks (segmentation, depth, 3D), proving that high-quality feature upsampling is a generalizable problem solvable with a single architecture.
Open Source: The authors have released code and pre-trained weights, facilitating immediate adoption and further research in the community.

In summary, AnyUp represents a paradigm shift in feature upsampling, moving from encoder-specific, computationally heavy solutions to a universal, efficient, and high-fidelity framework that generalizes across the entire landscape of modern vision models.