Local-Global Prompt Learning via Sparse Optimal Transport

Imagine you have a super-smart librarian (the AI model) who has read millions of books and seen millions of pictures. This librarian is great at matching a picture of a "dog" to the word "dog." However, if you ask them to distinguish between a specific breed of dog, like a "Golden Retriever" vs. a "Labrador," just looking at the whole picture isn't enough. They need to zoom in on the details: the shape of the ears, the texture of the fur, or the color of the nose.

This paper introduces a new way to teach this librarian how to spot those tiny, crucial details without getting confused. The method is called SOT-GLP.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Crowded Room" Issue

Previous methods tried to teach the librarian by giving them a list of "local clues" (like "look at the ears" or "look at the tail"). But there was a flaw: every clue was shouting at the same time, all pointing to the same part of the picture.

The Analogy: Imagine a room full of detectives trying to solve a crime. If every detective points at the same suspect (the most obvious part of the image), they miss the other important clues. They all crowd the same spot, and the unique details get lost in the noise. This is called "prompt overlap."

2. The Solution: The "Specialized Team" Approach

The authors created a system where the AI has two distinct teams working together:

The General Manager (Global Branch): This team looks at the whole picture to get the big picture. "This is definitely a bird." They handle the general category.
The Specialized Detectives (Local Branch): This team zooms in on specific parts. But instead of everyone fighting for the same spot, they are assigned specific zones.
- Detective A looks at the beak.
- Detective B looks at the wings.
- Detective C looks at the feet.

3. The Magic Trick: "Fair Seating" (Optimal Transport)

How do you make sure the detectives don't all sit in the same chair? The paper uses a mathematical concept called Optimal Transport.

The Analogy: Think of it like a fair seating chart at a wedding. You have a set of important guests (the visual patches of the image) and a set of tables (the different clues/prompts).
Instead of letting everyone rush to the VIP table, the system uses a "fairness algorithm" to gently guide each guest to a different table. This ensures that every clue gets a unique, non-overlapping part of the image to study. This prevents the "crowded room" problem and forces the AI to learn diverse details.

4. The "Saliency Filter": Ignoring the Noise

Before the detectives start looking, the system puts on a pair of special glasses that blur out the background.

The Analogy: If you are looking for a specific bird in a tree, you don't want to waste time studying the green leaves or the blue sky. The system automatically filters out the "boring" background and only hands the detectives the "interesting" parts (the bird's feathers, beak, etc.).

5. The Big Discovery: Accuracy vs. Safety

The authors found something surprising about how they tune this system.

The "Sharp" Mode (High Accuracy): If you tweak the system to be super-optimized for the specific training data, it becomes incredibly accurate at identifying known items (like distinguishing 100 different types of flowers).
The "Safe" Mode (Robustness): If you don't tweak it as much (removing a specific "projection" layer), the system stays closer to its original, natural state.
- The Result: The "Safe" mode is slightly less perfect at identifying known flowers, but it is much better at spotting things it has never seen before (like a picture of a toaster when it's only trained on animals). It knows, "I don't recognize this," much faster than the other methods.

Why This Matters

In the real world, AI doesn't just need to be smart; it needs to be honest.

Old AI: Might confidently guess that a toaster is a "cat" because it's trying too hard to fit the picture into a category it knows.
SOT-GLP: Can say, "I see a cat-like shape, but the texture and parts don't match any cat I know. This is probably something else."

Summary

SOT-GLP is like hiring a team of specialized detectives who are forced to look at different parts of a crime scene without stepping on each other's toes. By using a "fair seating" system (Optimal Transport) to divide the work, the AI becomes better at spotting fine details. Plus, the researchers discovered that by keeping the system slightly "raw" (not over-tuned), it becomes a much better "bodyguard," capable of spotting dangerous or weird situations that it wasn't trained for.

Here is a detailed technical summary of the paper "Local-Global Prompt Learning via Sparse Optimal Transport" (SOT-GLP).

1. Problem Statement

Vision-Language Models (VLMs) like CLIP excel at zero-shot and few-shot learning by matching global image embeddings to text prompts. However, existing prompt learning methods face two critical limitations:

Loss of Fine-Grained Detail: Most approaches rely on a single global image embedding (e.g., the [CLS] token), which averages spatial information and discards crucial local features (textures, object parts, spatial configurations) needed to distinguish similar categories or detect Out-of-Distribution (OOD) samples.
Redundancy and Overlap in Local Alignment: Recent methods attempting to use local features often select image patches independently for each text prompt. This leads to prompt overlap, where multiple prompts attend to the same dominant regions, causing redundant feature usage and preventing effective specialization on distinct visual cues.

2. Methodology: SOT-GLP

The authors propose SOT-GLP (Sparse Optimal Transport Guided Local-Global Prompt Learning), a dual-branch framework that preserves global alignment while explicitly modeling fine-grained spatial structure through a shared, sparse patch support.

A. Dual-Branch Architecture

The model employs two parallel streams within the vision encoder:

Global Branch: Uses standard CLIP Query-Key (Q-K) self-attention to extract the global [CLS] token. It learns shared global prompts ( $P_g$ ) to maintain robust category-level alignment and prevent overfitting.
Local Branch: Uses a parallel Value-Value (V-V) attention stream. Unlike Q-K attention, V-V attention directly correlates value representations, strengthening patch-to-patch interactions to capture textures and fine-grained parts. This stream extracts local patch tokens ( $Z_{local}$ ).

B. Prompt Parameterization

Global Prompts: A set of learnable token sequences shared across all classes, initialized from a standard template (e.g., "a photo of a").
Local Prompts: Class-specific learnable prompts ( $P_c^\ell$ ) designed to capture discriminative attributes unique to each category.

C. Sparse Optimal Transport (SOT) Alignment

To solve the problem of redundant patch selection, the local branch introduces a two-stage alignment process:

Saliency-Guided Sparsification: Instead of aligning all patches, the model computes a saliency map for each class by averaging the similarity between patches and the class-specific local prompts. It then selects a shared Top-K set of the most salient patches ( $S_c$ ) for that class. This filters out background noise and establishes a common support set.
Balanced Entropic Optimal Transport: The selected sparse patches are aligned to the multiple class-specific local prompts using Optimal Transport (OT).
- Balanced Marginals: The OT formulation enforces uniform marginal constraints, ensuring that the "mass" (assignment probability) is distributed evenly among prompts.
- Effect: This prevents "prompt collapse" (where all prompts focus on the same single patch) and forces a soft partition of patches. Different prompts specialize in different visual parts (e.g., one prompt focuses on the head, another on the tail), ensuring diversity and non-overlapping specialization.

D. Training and Inference

Loss Function: The total loss is a weighted sum of the global contrastive loss ( $L_{global}$ ) and the local cross-entropy loss ( $L_{local}$ ) derived from the OT scores.
Inference: Final scores combine the global similarity and the OT-weighted local similarity.
OOD Detection Strategy: The authors identify that learnable local projections can distort the pre-trained feature manifold. They propose a variant without the learnable local projection to preserve the native CLIP geometry, which significantly improves OOD detection robustness.

3. Key Contributions

Shared Sparse Patch Support: Unlike prior works that select patches independently per prompt, SOT-GLP selects a single shared set of salient patches and allocates them via Optimal Transport, eliminating redundancy.
Balanced Optimal Transport: The use of balanced entropic OT ensures that class-specific prompts specialize in distinct visual regions, preventing prompt collapse and improving feature diversity.
V-V Attention for Locality: The framework repurposes Value-Value attention as a dedicated stream for extracting locality-aware features, enhancing the model's ability to capture textures and parts.
Accuracy-Robustness Trade-off Discovery: The paper demonstrates a distinct trade-off:
- With Learnable Projection: Maximizes few-shot classification accuracy.
- Without Learnable Projection: Preserves the pre-trained CLIP manifold, yielding state-of-the-art Out-of-Distribution (OOD) detection performance.

4. Experimental Results

The method was evaluated on 11 standard benchmarks (including ImageNet, Caltech101, OxfordPets, Flowers102, etc.) and OOD detection suites.

Few-Shot Classification:
- On 16-shot ViT-B/16, SOT-GLP achieved an average accuracy of 85.1%, outperforming all prior prompt-learning baselines (e.g., GalLoP at 84.4%, CoOp at 79.9%).
- It set new state-of-the-art (SOTA) results on 9 out of 11 datasets, with significant gains on texture-heavy (DTD) and fine-grained (Flowers102, Cars) tasks.
Out-of-Distribution (OOD) Detection:
- The variant without the learnable local projection achieved 94.2% AUC and 23.8% FPR95 on ImageNet-based OOD detection.
- This significantly outperformed fully adapted models (e.g., GalLoP: 93.2% AUC) while maintaining high in-distribution accuracy (75.4% vs. 75.5%).
Ablation Studies:
- Removing V-V attention reduced accuracy by 0.3%.
- Removing the local projection reduced accuracy by 0.9% but boosted OOD detection to SOTA levels.
- Removing class-specific prompts reduced accuracy by 0.6%, confirming the necessity of per-class specialization.

5. Significance

SOT-GLP represents a significant advancement in prompt learning by addressing the "local evidence allocation" problem. By mathematically enforcing a partition of visual regions among prompts via Optimal Transport, it ensures that the model learns complementary rather than redundant features.

Crucially, the paper highlights a practical insight for deployment: the choice of whether to include a learnable projection layer allows practitioners to tune the model for either maximum in-distribution accuracy or maximum robustness against distribution shifts (OOD). This flexibility makes SOT-GLP a versatile framework for real-world applications where data distribution shifts are common.