ϕϕ-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

This paper introduces ϕ\phi-DPO, a novel Fairness Direct Preference Optimization framework for Large Multimodal Models that mitigates both catastrophic forgetting and data imbalance-induced bias through a new loss function and pairwise preference alignment, achieving state-of-the-art performance in continual learning benchmarks.

Thanh-Dat Truong, Huu-Thien Tran, Jackson Cothren, Bhiksha Raj, Khoa Luu

Published 2026-02-27
📖 4 min read☕ Coffee break read

Imagine you have a brilliant, all-knowing assistant (a Large Multimodal Model, or LMM) who can see pictures, read text, and solve complex problems. You want this assistant to keep learning new things every day—like how to diagnose a specific disease, how to read a new type of map, or how to understand a new language—without forgetting everything it already knew.

This is the challenge of Continual Learning. But there's a catch: the data the assistant learns from is often messy and unfair. Some topics have thousands of examples (like "Biology"), while others have very few (like "Grammar").

If you just keep feeding this assistant new data, two bad things happen:

  1. Catastrophic Forgetting: It learns the new stuff so well it forgets the old stuff. It's like a student who studies for a math test so hard they forget how to read.
  2. Bias: Because there's more data on "Biology," the assistant gets really good at Biology but terrible at "Grammar." It becomes unbalanced and unfair.

This paper introduces a new solution called ϕ-DPO (pronounced "Phi-DPO"). Think of it as a Fairness Coach for your AI assistant. Here is how it works, using simple analogies:

1. The Old Way: "The Heavy-Handed Teacher"

Previously, researchers tried to stop forgetting by using a method called Knowledge Distillation. Imagine a teacher telling a student: "Don't change your mind too much! Remember what you knew yesterday."

  • The Problem: If the student is surrounded by a loud crowd shouting about "Biology," the teacher's advice gets drowned out. The student still ends up ignoring "Grammar" because the crowd is too loud. The teacher can't fix the unfairness of the crowd.

2. The New Way: "The Preference Coach" (DPO)

The authors first switched to a method called Direct Preference Optimization (DPO). Instead of just saying "don't forget," this coach says: "Look at these two answers. One is good (remembering the past), and one is bad (forgetting). Which one do you prefer?"

  • The Analogy: Imagine a coach showing an athlete two video replays: one where they played perfectly yesterday, and one where they messed up today. The coach asks, "Which one do you want to be?" The athlete naturally tries to match the "good" video.
  • The Benefit: This helps the AI remember the past much better than the old "don't change" method.

3. The Problem with the New Way: "The Loud Crowd"

Even with the Preference Coach, there was still a problem. If the "bad" examples (the ones the AI forgets) mostly come from the "Grammar" group, and the "good" examples come from the "Biology" group, the coach still gets biased. The AI thinks, "Oh, Biology is important because there are so many Biology examples here. Grammar doesn't matter."

4. The Solution: "The Fairness Filter" (ϕ-DPO)

This is where ϕ-DPO shines. The authors added a special "Fairness Filter" to the coach's whistle.

  • The Metaphor: Imagine the coach is holding a magnifying glass. When the AI tries to learn from the "Biology" crowd (which is huge and loud), the coach uses the magnifying glass to dim their voice slightly. But when the AI tries to learn from the "Grammar" group (which is small and quiet), the coach amplifies their voice.
  • The Result: The AI is forced to pay equal attention to the quiet, difficult topics (the "minority" groups) as it does to the loud, easy ones. It learns to balance the scales.

Why is this a big deal?

In the real world, data is rarely perfect. We often have tons of photos of cats but very few of rare animals. We have lots of medical data for common diseases but very little for rare ones.

  • Without ϕ-DPO: The AI becomes a specialist in common things and fails at rare things, while also forgetting its old skills.
  • With ϕ-DPO: The AI becomes a well-rounded expert. It remembers its old skills, learns new ones, and treats every topic fairly, regardless of how much data is available.

The Bottom Line

The authors built a system that teaches AI to learn continuously without losing its memory or becoming biased. They proved mathematically that this works and tested it on real-world benchmarks (like medical imaging, remote sensing, and visual reasoning). The result? An AI that is smarter, fairer, and doesn't forget what it learned yesterday.

In short: They gave the AI a coach that not only helps it remember the past but also ensures it listens to the quiet voices in the room, not just the loud ones.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →