Learning to Select Visual In-Context Demonstrations

Imagine you are trying to teach a very smart, but slightly confused, robot how to guess the age of a person just by looking at a photo. This robot is a Multimodal Large Language Model (MLLM). It's like a brilliant student who has read every book in the library but has never actually seen a human face before.

To help the robot, you give it a few example photos with the correct answers written next to them. This is called In-Context Learning (ICL). The robot looks at your examples, figures out the pattern, and then guesses the age of the new photo.

The big problem? Which examples do you show the robot?

The Old Way: The "Look-Alike" Strategy (kNN)

Traditionally, computers use a simple rule: "Show the robot examples that look exactly like the new photo."

The Analogy: Imagine you ask the robot to guess the age of a 10-year-old boy. The old method (called k-Nearest Neighbors or kNN) would show it 10 other photos of 10-year-old boys who look almost identical.
The Flaw: This is like studying for a math test by only doing problems that look exactly like the one you're stuck on. You might get that specific problem right, but you don't learn the range of possibilities. If the robot only sees 10-year-olds, it might get confused if the new photo is a 10-year-old in a weird hat or a 10-year-old with a different skin tone. It lacks context.

The New Way: The "Smart Curator" (LSD)

The authors of this paper, Eugene Lee and his team, realized that for complex tasks (like guessing age or image quality), you don't just need similar examples; you need diverse examples that cover the whole spectrum.

They created a system called LSD (Learning to Select Demonstrations). Instead of a simple rule, they built a Reinforcement Learning Agent—think of this agent as a Smart Curator or a Coach.

How the Coach Works:

The Goal: The Coach's job isn't just to find similar photos; it's to build a "study guide" that helps the robot get the best possible score on the final test.
The Strategy:
- If the task is Objective (like guessing age or image quality), the Coach knows the robot needs to see the whole picture. So, for a 10-year-old query, the Coach might show:
  - One baby (the bottom of the scale).
  - One teenager (the middle).
  - One elderly person (the top).
  - And a few other 10-year-olds with different features.
- This creates a "boundary" for the robot. It learns, "Okay, this kid is older than the baby but younger than the teenager."
The Learning: The Coach learns by trial and error. It tries different sets of photos, sees how well the robot guesses, and gets a "reward" if the robot is right. Over time, it learns the perfect mix of relevance (photos that matter) and diversity (photos that show different extremes).

The Big Discovery: One Size Does Not Fit All

The most fascinating part of the paper is a "Dichotomy" (a split in how things work):

For "Factual" Tasks (Age, Image Quality):
- The Old Way (kNN) fails. It's too repetitive.
- The New Way (LSD) wins. The "Smart Curator" is essential here because it teaches the robot the range of the answer. It's like teaching someone to estimate weight by showing them a feather, a bowling ball, and a car, rather than just 10 bowling balls.
For "Subjective" Tasks (Aesthetics, Beauty):
- The Old Way (kNN) wins.
- The New Way (LSD) struggles.
- Why? Beauty is in the eye of the beholder. If you ask, "Is this sunset beautiful?", showing the robot 10 different types of sunsets might confuse it. It's better to show it 10 sunsets that look exactly like the one you are asking about, so it can say, "This one is just as beautiful as those." Here, similarity is king, and diversity is noise.

The Takeaway

This paper teaches us that how we teach AI depends entirely on what we are asking it to do.

If you want the AI to learn facts and ranges (like math or science), you need a teacher who provides a diverse curriculum (LSD).
If you want the AI to learn taste and style (like art or fashion), you need a teacher who provides perfect examples (kNN).

The authors didn't just build a better tool; they figured out when to use a hammer and when to use a screwdriver, solving a major puzzle in how we get AI to learn from examples.

1. Problem Statement

Multimodal Large Language Models (MLLMs) rely heavily on In-Context Learning (ICL) to adapt to visual tasks without fine-tuning. The efficacy of ICL depends critically on the quality of the demonstration examples (shots) provided in the prompt.

Current Limitation: The dominant strategy for selecting demonstrations is unsupervised $k$ -Nearest Neighbor ( $k$ NN) search based on visual feature similarity.
The Flaw: While simple, $k$ NN often selects redundant examples that are visually similar but fail to capture the full output range of the task. This is particularly detrimental for objective, factual regression tasks (e.g., age estimation, image quality assessment) where the model needs to understand the boundaries of the regression space. $k$ NN tends to create a "redundancy trap," providing misleading context.
The Gap: Existing research has not sufficiently addressed when learned selection is necessary versus when simple similarity suffices, nor has it effectively solved the selection problem for large-scale datasets using reinforcement learning.

2. Methodology: Learning to Select Demonstrations (LSD)

The authors reframe demonstration selection as a sequential decision-making problem modeled as a Markov Decision Process (MDP) and solved via Reinforcement Learning (RL).

A. Problem Formulation (MDP)

State ( $s_t$ ): Composed of the query embedding ( $e_q$ ) and the embeddings of previously selected demonstrations ( $\{e_1, \dots, e_{t-1}\}$ ).
Action ( $a_t$ ): Selecting the next demonstration from a massive pool of $N$ candidates (where $N$ can be ~50,000).
Reward ( $r_t$ ): Defined by the marginal improvement in the MLLM's downstream performance. Specifically, $r_t = -\text{MAE}_{new} - (-\text{MAE}_{old})$ , where MAE is the Mean Absolute Error of the MLLM's prediction.
Goal: Learn a policy that maximizes cumulative reward (minimizes final MAE).

B. Architecture: Dueling DQN with Query-Centric Decoder

To handle the massive action space ( $O(N)$ ), the authors propose a novel architecture:

Query-Centric Transformer Decoder:
- Instead of concatenating all embeddings (which led to "policy collapse" where the agent ignored the specific query), they use a Transformer Decoder.
- The query embedding acts as the target sequence, and the set of selected demonstrations acts as the memory sequence.
- This forces the agent to attend to the specific relationship between the query and the context, learning a query-specific policy.
Dueling Heads:
- Value Head: Estimates the state value $V(s)$ .
- Advantage Head: Outputs a $D$ -dimensional "advantage query" vector ( $a_s$ ).
- Q-Value Calculation: The advantage for a specific action (sample $i$ ) is calculated as the inner product (cosine similarity) between the advantage query $a_s$ and the sample's pre-computed embedding $e_i$ : $A(s, a_i) = a_s^\top e_i$ .
Approximate Action Selection (FAISS):
- Instead of computing Q-values for all $N$ samples (computationally infeasible), the agent uses FAISS (Approximate Nearest Neighbor) to retrieve the top $K$ candidates (e.g., $K=200$ ) based on the advantage query.
- The final action is selected by maximizing the Dueling Q-value only within this small candidate set.

3. Key Contributions

LSD Framework: Introduced a novel RL framework that reframes $K$ -shot selection as a sequential decision process, scaling to dataset-level action spaces using Dueling DQN and efficient retrieval.
Architectural Innovation: Designed a Query-Centric Transformer Decoder to prevent policy collapse, ensuring the agent learns selection policies specific to the input query rather than generic "good" examples.
Theoretical Insight (The Dichotomy): The paper uncovers a fundamental task-dependent dichotomy:
- Subjective Tasks (e.g., aesthetic preference, facial beauty): Unsupervised $k$ NN (visual similarity) remains optimal.
- Objective Regression Tasks (e.g., age, image quality): Learned, diversity-aware policies are strictly necessary to define regression boundaries.
Comprehensive Evaluation: Validated across five diverse visual regression benchmarks (UTKFace, AVA, SCUT-FBP5500, KonIQ-10k, KADID-10k) and multiple MLLMs (Gemma, Qwen, Phi).

4. Experimental Results

Performance on Objective Tasks: On UTKFace (Age), KonIQ-10k, and KADID-10k, LSD significantly outperforms $k$ NN and Random baselines. The performance gap widens as the number of shots ( $K$ $K$ ) increases.
- Example: On UTKFace with $K=16$ , LSD achieved an MAE of 6.64, compared to $k$ NN's 7.60.
Performance on Subjective Tasks: On AVA (Aesthetics) and SCUT-FBP5500 (Beauty), $k$ NN outperforms LSD. This confirms that for subjective tasks, visual similarity is the dominant factor, and forced diversity introduces noise.
Qualitative Analysis:
- $k$ NN selects redundant, visually homogeneous sets (e.g., only babies for a child query).
- LSD learns to select diverse "boundary" examples (e.g., toddlers, adults, and elderly for age estimation; pristine and variously distorted images for quality assessment) to help the MLLM "triangulate" the correct answer.
Generalization:
- Cross-Model: A policy trained on Gemma 3 transfers effectively to Qwen 2.5 and Phi-3.5, often outperforming $k$ NN on objective tasks.
- Cross-Dataset: A policy trained on Age estimation transfers well to Image Quality Assessment (KADID-10k) but fails on Facial Beauty (SCUT-FBP5500), highlighting that "diversity" is beneficial for objective regression but harmful for specific demographic/subjective tasks.
Order Sensitivity: Permutation tests showed that the set of selected demonstrations matters more than the order, suggesting the agent primarily learns to select the optimal subset.

5. Significance and Conclusion

The paper establishes that not all in-context learning tasks require learned selection.

When to use LSD: For objective, factual regression tasks where the ground truth is a continuous value, learned selection is crucial. It forces the model to see the full spectrum of the regression space by balancing visual relevance with necessary diversity.
When to use $k$ NN: For subjective preference tasks, simple visual similarity is sufficient and often superior.

Impact: LSD provides a non-degenerate, generalizable policy that moves beyond simple retrieval. It demonstrates that for complex visual reasoning, the "best" examples are not necessarily the most visually similar ones, but the most informative ones that define the task's boundaries. This work illuminates the specific conditions under which learning to select is strictly necessary for optimal MLLM performance.

Learning to Select Visual In-Context Demonstrations

The Old Way: The "Look-Alike" Strategy (kNN)

The New Way: The "Smart Curator" (LSD)

How the Coach Works:

The Big Discovery: One Size Does Not Fit All

The Takeaway

1. Problem Statement

2. Methodology: Learning to Select Demonstrations (LSD)

A. Problem Formulation (MDP)

B. Architecture: Dueling DQN with Query-Centric Decoder

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

Mitigating Forgetting in Continual Learning with Selective Gradient Projection

Boundary-aware Prototype-driven Adversarial Alignment for Cross-Corpus EEG Emotion Recognition

TED: Training-Free Experience Distillation for Multimodal Reasoning

A Step Toward Federated Pretraining of Multimodal Large Language Models

Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints