$ϕ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

Imagine you have a brilliant, all-knowing assistant (a Large Multimodal Model, or LMM) who can see pictures, read text, and solve complex problems. You want this assistant to keep learning new things every day—like how to diagnose a specific disease, how to read a new type of map, or how to understand a new language—without forgetting everything it already knew.

This is the challenge of Continual Learning. But there's a catch: the data the assistant learns from is often messy and unfair. Some topics have thousands of examples (like "Biology"), while others have very few (like "Grammar").

If you just keep feeding this assistant new data, two bad things happen:

Catastrophic Forgetting: It learns the new stuff so well it forgets the old stuff. It's like a student who studies for a math test so hard they forget how to read.
Bias: Because there's more data on "Biology," the assistant gets really good at Biology but terrible at "Grammar." It becomes unbalanced and unfair.

This paper introduces a new solution called ϕ-DPO (pronounced "Phi-DPO"). Think of it as a Fairness Coach for your AI assistant. Here is how it works, using simple analogies:

1. The Old Way: "The Heavy-Handed Teacher"

Previously, researchers tried to stop forgetting by using a method called Knowledge Distillation. Imagine a teacher telling a student: "Don't change your mind too much! Remember what you knew yesterday."

The Problem: If the student is surrounded by a loud crowd shouting about "Biology," the teacher's advice gets drowned out. The student still ends up ignoring "Grammar" because the crowd is too loud. The teacher can't fix the unfairness of the crowd.

2. The New Way: "The Preference Coach" (DPO)

The authors first switched to a method called Direct Preference Optimization (DPO). Instead of just saying "don't forget," this coach says: "Look at these two answers. One is good (remembering the past), and one is bad (forgetting). Which one do you prefer?"

The Analogy: Imagine a coach showing an athlete two video replays: one where they played perfectly yesterday, and one where they messed up today. The coach asks, "Which one do you want to be?" The athlete naturally tries to match the "good" video.
The Benefit: This helps the AI remember the past much better than the old "don't change" method.

3. The Problem with the New Way: "The Loud Crowd"

Even with the Preference Coach, there was still a problem. If the "bad" examples (the ones the AI forgets) mostly come from the "Grammar" group, and the "good" examples come from the "Biology" group, the coach still gets biased. The AI thinks, "Oh, Biology is important because there are so many Biology examples here. Grammar doesn't matter."

4. The Solution: "The Fairness Filter" (ϕ-DPO)

This is where ϕ-DPO shines. The authors added a special "Fairness Filter" to the coach's whistle.

The Metaphor: Imagine the coach is holding a magnifying glass. When the AI tries to learn from the "Biology" crowd (which is huge and loud), the coach uses the magnifying glass to dim their voice slightly. But when the AI tries to learn from the "Grammar" group (which is small and quiet), the coach amplifies their voice.
The Result: The AI is forced to pay equal attention to the quiet, difficult topics (the "minority" groups) as it does to the loud, easy ones. It learns to balance the scales.

Why is this a big deal?

In the real world, data is rarely perfect. We often have tons of photos of cats but very few of rare animals. We have lots of medical data for common diseases but very little for rare ones.

Without ϕ-DPO: The AI becomes a specialist in common things and fails at rare things, while also forgetting its old skills.
With ϕ-DPO: The AI becomes a well-rounded expert. It remembers its old skills, learns new ones, and treats every topic fairly, regardless of how much data is available.

The Bottom Line

The authors built a system that teaches AI to learn continuously without losing its memory or becoming biased. They proved mathematically that this works and tested it on real-world benchmarks (like medical imaging, remote sensing, and visual reasoning). The result? An AI that is smarter, fairer, and doesn't forget what it learned yesterday.

In short: They gave the AI a coach that not only helps it remember the past but also ensures it listens to the quiet voices in the room, not just the loud ones.

1. Problem Statement

The paper addresses two critical, interconnected challenges in the Continual Learning (CL) of Large Multimodal Models (LMMs):

Catastrophic Forgetting: As LMMs learn new tasks sequentially, they tend to overwrite previously acquired knowledge, leading to performance degradation on earlier tasks.
Fairness under Data Imbalance: Real-world multimodal datasets often exhibit significant class or domain imbalances (e.g., some topics have far fewer samples than others). Standard continual learning methods, including those using Direct Preference Optimization (DPO), tend to bias gradient updates toward majority classes. This results in "unfair" model updates where minority groups suffer from higher forgetting rates and lower adaptability.

Existing solutions like Low-Rank Adaptation (LoRA) or Knowledge Distillation (KD) often fail to simultaneously mitigate forgetting and correct for these distributional biases in multimodal settings.

2. Methodology: The ϕ-DPO Framework

The authors propose ϕ-DPO (Fairness Direct Preference Optimization), a novel framework that integrates preference learning with fairness constraints to handle continual learning in LMMs.

A. Reformulating Continual Learning via DPO

Instead of traditional Knowledge Distillation (which minimizes KL divergence directly), the authors model the forgetting problem using a Reinforcement Learning from Human Feedback (RLHF) perspective, simplified via Direct Preference Optimization (DPO).

Mechanism: For each learning step $t$ , the model $\pi_t$ is optimized to prefer outputs that retain memory of previous tasks ( $y^+$ ) over outputs that represent forgotten knowledge ( $y^-$ ).
Objective: The standard DPO loss maximizes the log-likelihood of the preferred output relative to the dispreferred one, implicitly regularizing the policy to stay close to the previous step's policy ( $\pi_{t-1}$ ) while adapting to new data.
Theoretical Advantage: The paper proves that the DPO loss bounds the KL divergence used in traditional distillation, offering a more robust mechanism for preventing forgetting while allowing necessary plasticity.

B. The Fairness DPO Loss ( $\phi$ -DPO)

To address the bias caused by imbalanced data distributions, the authors introduce a modified loss function inspired by Focal Loss.

Problem: In imbalanced data, standard DPO gradients are dominated by majority groups, causing the model to neglect minority groups.
Solution: They introduce a focusing parameter $\gamma$ into the DPO loss:
$L^\gamma_{DPO} = -\mathbb{E}[(1 - p(z))^\gamma \log p(z)]$
where $p(z)$ is the probability of the preferred pair.
Effect: The term $(1 - p(z))^\gamma$ down-weights easy (majority) examples and up-weights hard (minority) examples. Theoretically, as $\gamma \to \infty$ , the gradient updates become balanced across groups regardless of the underlying data distribution, effectively neutralizing the bias.

C. Implementation Details

Data Construction: Since existing CL benchmarks lack pairwise preference data, the authors constructed pairwise preference annotations for standard benchmarks (CoIN, MLLM-CL). They treat ground-truth answers as preferred ( $y^+$ ) and use an LLM to generate plausible but flawed "hallucinated" answers as dispreferred ( $y^-$ ).
Architecture: The method is implemented using LoRA (Low-Rank Adaptation) to optimize the LMM parameters efficiently, preventing overfitting on small-scale continual data.

3. Key Contributions

Novel Paradigm: Introduction of a DPO-based continual learning paradigm for LMMs that replaces traditional distillation with pairwise preference optimization.
Fairness Mechanism: Development of the $\phi$ -DPO loss, which explicitly corrects for distributional biases in multimodal data, ensuring fair performance across imbalanced classes.
Theoretical Analysis:
- Proved that DPO loss bounds the KL divergence (lower and upper bounds), validating its efficacy in forgetting mitigation.
- Proved that the Fair DPO loss yields balanced gradient updates as the focusing parameter $\gamma$ increases, theoretically eliminating bias-induced gradient disparity.
Benchmarking: Creation of pairwise preference datasets for existing continual learning benchmarks (CoIN, MLLM-CL Domain, MLLM-CL Ability) to enable DPO training.

4. Experimental Results

The authors evaluated $\phi$ -DPO on three major benchmarks: CoIN, MLLM-CL Domain, and MLLM-CL Ability, comparing it against State-of-the-Art (SOTA) methods like LoRA-FT, HiDe, SEFE, and MR-LoRA.

Performance Metrics: The study used Last Accuracy, Mean Final Accuracy (MFN), Mean Average Accuracy (MAA), and Backward Transfer (BWT) to measure forgetting.
Key Findings:
- SOTA Performance: $\phi$ -DPO consistently outperformed all baselines across all benchmarks. For example, on the MLLM-CL Domain benchmark, it achieved an MFN of 74.00% and a BWT of -0.37% (indicating near-zero forgetting), significantly beating the previous best (MR-LoRA at 69.63% MFN).
- Robustness to Imbalance: The method maintained high accuracy on minority domains (e.g., Medical, Remote Sensing) where other methods suffered significant drops.
- Ablation Studies:
  - $\beta$ (Divergence Parameter): Tuning $\beta$ balances stability (retaining old knowledge) vs. plasticity (learning new tasks). $\beta=0.10$ provided the best trade-off.
  - $\gamma$ (Focusing Parameter): $\gamma=2.00$ was optimal, effectively balancing fairness and adaptability. Too high a $\gamma$ caused gradient vanishing; too low acted like vanilla DPO.
- Generalizability: The approach worked effectively across different LMM backbones (LLaVA-7B, LLaVA-13B, InternVL-7B).

5. Significance and Impact

Solving the "Fairness-Forgetting" Trade-off: This work is one of the first to explicitly tackle the intersection of fairness and catastrophic forgetting in multimodal continual learning. It demonstrates that models can learn new tasks without sacrificing performance on minority groups.
Theoretical Rigor: By providing formal bounds linking DPO to KL divergence and proving the convergence of gradient bias with the focusing parameter, the paper offers a strong theoretical foundation for preference-based continual learning.
Practical Applicability: The construction of preference datasets for standard benchmarks fills a critical gap in the field, enabling future research to utilize DPO for continual learning tasks.
Real-World Relevance: As LMMs are deployed in dynamic, real-world environments with shifting data distributions, $\phi$ -DPO offers a pathway to maintain reliable, unbiased, and up-to-date AI assistants without the computational cost of full retraining.

In summary, $\phi$ -DPO represents a significant advancement in making Large Multimodal Models more robust, fair, and adaptable in sequential learning scenarios.

ϕϕϕ-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

1. The Old Way: "The Heavy-Handed Teacher"

2. The New Way: "The Preference Coach" (DPO)

3. The Problem with the New Way: "The Loud Crowd"

4. The Solution: "The Fairness Filter" (ϕ-DPO)

Why is this a big deal?

The Bottom Line

1. Problem Statement

2. Methodology: The ϕ-DPO Framework

A. Reformulating Continual Learning via DPO

B. The Fairness DPO Loss (ϕ\phiϕ-DPO)

C. Implementation Details

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Convolutional Surrogate for 3D Discrete Fracture-Matrix Tensor Upscaling

Generating Counterfactual Patient Timelines from Real-World Data

LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning

SIEVE: Sample-Efficient Parametric Learning from Natural Language

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models

$ϕ$ -DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

B. The Fairness DPO Loss ( $\phi$ -DPO)