Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

Imagine you are a teacher trying to help a class of students (an AI model) learn how to solve difficult math and logic puzzles using pictures. You have a special grading system called Group Relative Policy Optimization (GRPO).

Here's how the old system worked:
Every time you give the class a puzzle, you ask them to come up with 8 different answers. You then look at the group of 8 answers. If 7 are wrong and 1 is right, the "right" answer gets a massive boost in confidence. If 7 are right and 1 is wrong, the "wrong" answer gets a massive penalty.

The Problem:
The problem is that this system gets confused by extreme cases.

The "Too Easy" Puzzle: Imagine a picture of a single red apple. The model gets 8 answers, and all 8 are "Red Apple." The system calculates the "average" and "spread" of these answers. Since they are all the same, the "spread" is zero. The math breaks down, and the model gets confused about how much to learn from this easy win.
The "Too Hard" Puzzle: Imagine a picture of a chaotic, abstract mess. The model gets 8 answers, and all 8 are nonsense. Again, the "spread" is tiny. The model gets confused about how much to punish itself for failing.

In the paper's terms, these are extreme samples where the "standard deviation" (a measure of how different the answers are) is too small, causing the math to go haywire. It's like trying to measure the temperature of a room with a thermometer that only works if the temperature is changing rapidly; if the room is perfectly still, the thermometer breaks.

The Solution: "Durian" (Difficulty-Aware Group Normalization)
The authors, Jinghan Li and colleagues, realized that instead of treating every group of answers the same, they should sort the puzzles by difficulty first. They call their new method Durian (named after the spiky fruit, perhaps implying it's tough but rewarding, or just a catchy name).

They split the puzzles into two types of difficulty:

Visual Difficulty (Perceptual):
- The Analogy: Imagine sorting photos by how "busy" they are. A photo of a blank white wall is low difficulty. A photo of a crowded city street with thousands of tiny details is high difficulty.
- The Fix: They group the "busy" photos together and the "simple" photos together. They only compare the answers within the "busy" group and within the "simple" group. This prevents a simple photo from messing up the math for a complex one.
Thinking Difficulty (Reasoning):
- The Analogy: Imagine sorting questions by how confident the students feel. Some questions make the students say, "I'm 100% sure!" (High confidence). Others make them say, "I'm guessing..." (Low confidence).
- The Fix: They group the "guessing" questions together and the "sure" questions together. They only compare the answers of the "guessing" group against each other, and the "sure" group against each other.

How it Works in Practice:
Instead of one giant classroom where everyone is compared to everyone else, Durian creates small study groups based on how hard the task is.

The "Easy Visual" group shares their own "standard score."
The "Hard Visual" group shares their own "standard score."
The "Confident Thinkers" group shares their own score.
The "Uncertain Thinkers" group shares their own score.

By doing this, the "extreme" cases (where everyone gets it right or everyone gets it wrong) don't break the math because they are compared only to others who are in the same boat.

The Result:
The paper shows that this method makes the AI much smarter. It learned to solve visual math problems (like geometry and charts) much better than before. On average, it improved its scores by over 11%, and on some tricky tests, it jumped by 16%.

In a Nutshell:
The old way was like putting a genius, a beginner, and a confused student in the same room and asking them to grade each other. It didn't work well because the gap was too big.
Durian is like putting the geniuses in one room, the beginners in another, and the confused students in a third. Now, everyone is learning from peers at their own level, leading to much faster and more stable improvement.

1. Problem Statement

The paper addresses a critical instability issue in Reinforcement Learning with Verifiable Rewards (RLVR), specifically when applied to Multi-Modal Large Language Models (MLLMs) using Group Relative Policy Optimization (GRPO).

The Core Issue: GRPO stabilizes training by normalizing rewards within a group of $G$ sampled responses using the group's standard deviation (std). However, this std-based normalization is highly sensitive to "extreme samples."
The Mechanism of Failure: In multimodal settings, response groups often collapse into scenarios where nearly all samples are either correct (rewards $\approx 1$ $\approx 1$ ) or incorrect (rewards $\approx 0$ $\approx 0$ ).
- When rewards are nearly uniform, the calculated std approaches zero.
- Dividing by a near-zero std causes the advantage values to explode, disproportionately amplifying the gradient signal from these extreme cases.
- Conversely, samples with balanced rewards (mixed correct/incorrect) are neglected.
Why MLLMs are Worse: Unlike pure text LLMs, MLLMs face dual challenges: perceptual complexity (visual noise, clutter) and reasoning uncertainty. This leads to a higher frequency of extreme reward distributions, making standard GRPO training unstable and prone to overfitting on easy or impossible samples.

2. Methodology: Durian (Difficulty-Aware Group Normalization)

The authors propose Durian, a strategy that re-groups samples based on difficulty before calculating the standard deviation for normalization. Instead of normalizing a random batch, Durian constructs groups where samples share similar difficulty levels, ensuring the std is computed over a meaningful distribution.

Durian decomposes difficulty into two complementary dimensions:

A. Perceptual Difficulty (Data-Centric)

Definition: Measures the visual complexity of the input image.
Implementation:
1. Extract patch-level visual features using a visual encoder (e.g., Qwen2.5-VL).
2. Compute the covariance matrix of these patch features.
3. Perform eigenvalue decomposition on the covariance matrix.
4. Calculate the Shannon entropy of the normalized eigenvalue distribution.
  - Low Entropy: Variance concentrated in few dimensions (simple, clean images).
  - High Entropy: Variance distributed across many dimensions (complex, cluttered images).
Grouping: Samples are partitioned into Low, Medium, and High entropy groups (using 25th and 75th percentiles). The std is shared within these perceptual groups.

B. Reasoning Difficulty (Model-Centric)

Definition: Measures the model's internal confidence in generating the reasoning chain.
Implementation:
1. Calculate the sequence-level log probability for each generated response.
2. Compute the average log probability across the $G$ rollouts for a specific input.
3. Low Confidence (Low Log Prob): Indicates high uncertainty and difficult reasoning.
4. High Confidence (High Log Prob): Indicates a clear, reliable reasoning path.
Grouping: Samples are grouped by confidence quantiles. The std is shared within these reasoning groups.

C. Combined Advantage

The final advantage ( $A_{Combined}$ ) is an element-wise weighted sum of three components:

Original GRPO Advantage ( $A_{GRPO}$ ).
Perceptual Difficulty-based Advantage ( $A_{Perceptual}$ ).
Reasoning Difficulty-based Advantage ( $A_{Reasoning}$ ).

$A_{Combined} = \alpha_{Ori} A_{GRPO} + \alpha_{Percep} A_{Perceptual} + \alpha_{Reason} A_{Reasoning}$

This ensures the model benefits from the original group distinctions while being stabilized by difficulty-aware normalization.

3. Key Contributions

Identification of a Structural Flaw: The paper empirically demonstrates that std-based normalization in GRPO is structurally unstable for MLLMs due to the high prevalence of extreme reward distributions (near 0 or 1) caused by multimodal complexity.
Novel Difficulty Characterization: Introduces a dual-perspective difficulty metric:
- Visual Entropy: A spectral analysis method to quantify image complexity without human labels.
- Model Confidence: Using token-level log probabilities to quantify reasoning uncertainty.
Durian Algorithm: Proposes a re-grouping mechanism that shares standard deviations within difficulty-homogeneous groups, effectively eliminating the "division by near-zero" problem while preserving intra-group distinctions.
Efficiency: Unlike methods that require increasing rollout sizes (which are computationally expensive), Durian achieves stability by smarter grouping within existing batch sizes.

4. Experimental Results

The authors evaluated Durian on Qwen2.5-VL-7B using the Geometry3K dataset (2.1k samples) and tested on five benchmarks: MathVerse, MathVision, MathVista, WeMath, and HallusionBench.

Performance Gains:
- Durian achieved an average improvement of 11.3% over the baseline Qwen2.5-VL-7B.
- Specific gains were significant on MathVision (+16%) and HallusionBench.
- It outperformed other RL-based SOTA models (e.g., R1-VL, Vision-R1) and even some closed-source models (GPT-4o, Claude-3.5) on specific reasoning tasks, despite using significantly less training data (2.1k vs. 200k+).
Ablation Studies:
- Both Perceptual and Reasoning regrouping strategies individually improved performance.
- The combination (Durian) yielded the best results, confirming the complementary nature of visual complexity and reasoning uncertainty.
Robustness: The method showed stability across different hyperparameters (number of groups, weighting coefficients) and rollout sizes.

5. Significance

Stabilizing Multimodal RL: Durian provides a robust solution to the training instability inherent in applying GRPO to MLLMs, a field where RLVR is becoming the standard for reasoning enhancement.
Data Efficiency: By stabilizing the learning signal, Durian allows models to learn effectively from smaller datasets (2.1k samples), reducing the computational cost and data curation burden compared to methods requiring massive synthetic datasets.
General Paradigm: The concept of "difficulty-aware normalization" offers a generalizable principle for stabilizing reinforcement learning in any domain where input complexity varies significantly, ensuring that optimization focuses on learnable samples rather than extreme outliers.

In summary, Durian transforms the GRPO training process from a "one-size-fits-all" normalization into a difficulty-adaptive process, significantly enhancing the reasoning capabilities of Multi-Modal LLMs by aligning the optimization signal with the intrinsic complexity of the data and the model's confidence.

Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

1. Problem Statement

2. Methodology: Durian (Difficulty-Aware Group Normalization)

A. Perceptual Difficulty (Data-Centric)

B. Reasoning Difficulty (Model-Centric)

C. Combined Advantage

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation