FBCIR: Balancing Cross-Modal Focuses in Composed Image Retrieval

This paper introduces FBCIR, a method to diagnose and address focus imbalances in composed image retrieval models by identifying their tendency to over-attend to one modality, and proposes a data augmentation workflow with curated hard negatives to enforce balanced cross-modal reasoning and improve robustness in challenging scenarios.

Chenchen Zhao, Jianhuan Zhuo, Muxi Chen, Zhaohua Zhang, Wenyu Jiang, Tianwen Jiang, Qiuyong Xiao, Jihong Zhang, Qiang Xu

Published 2026-03-13
📖 5 min read🧠 Deep dive

Imagine you are trying to find a specific photo in a massive library using a very smart, but slightly lazy, librarian.

The Task: You give the librarian a starting photo (say, a picture of a castle) and a note saying, "I want this, but in winter." Your goal is to find a picture of a castle covered in snow.

The Problem:
In the past, the librarian was trained on easy examples.

  • Easy Example 1: You show a castle and say "winter." The library only has pictures of castles in summer and pictures of snow-covered mountains (no castles). The librarian looks at the photo, sees "castle," ignores the note, and picks the only castle they see. They got the right answer, but they didn't actually listen to your note!
  • Easy Example 2: You show a summer castle and say "winter." The library has pictures of winter castles and summer castles. The librarian reads the note "winter," ignores the photo, and picks the winter castle. Again, right answer, wrong reasoning.

The librarian learned shortcuts. They realized they could just look at the picture or just read the note to get the answer, without doing the hard work of combining both.

The Real Challenge (The "Hard Case"):
Now, imagine a tricky test. You show a summer castle and say "winter." The library has:

  1. A winter castle (The correct answer).
  2. A winter mountain (Matches the note, wrong picture).
  3. A summer castle (Matches the picture, wrong note).

If the librarian is still using their shortcuts, they will fail. They might pick the winter mountain because they only read the note, or the summer castle because they only looked at the photo. To get the right answer, they must balance their attention: they need to look at the castle and the word "winter" at the same time.

Enter FBCIR: The Librarian's "Focus Check"

The authors of this paper, Chenchen Zhao and their team, realized that most AI models (the librarians) are bad at this balancing act. They created a tool called FBCIR to diagnose the problem.

1. The Diagnosis (Focus Interpretation)
Think of FBCIR as a "spotlight" that shines on exactly what the AI is looking at when it makes a decision.

  • It takes the photo and breaks it into tiny puzzle pieces.
  • It takes the text and breaks it into individual words.
  • It then plays a game of "What if I hide this piece?" If hiding a specific word (like "winter") or a specific part of the image (the castle tower) changes the answer, that piece is crucial.

By doing this, they found that most AIs are unbalanced. They are like a person wearing a blindfold over one eye. Sometimes they only look at the image; sometimes they only read the text. They aren't using both eyes together.

2. The Cure (Data Augmentation)
Knowing the problem, the team created a new training method called FBCIR-Data. Instead of giving the AI easy practice tests, they built a "Boot Camp" of difficult scenarios.

  • The Trick: They created "fake" wrong answers (negatives) that are designed to trick the lazy shortcuts.
    • Scenario A: They show a picture of a castle and a note saying "winter," but they add a fake option that is a winter mountain. If the AI only looks at the picture, it picks the mountain. If it only reads the text, it picks the mountain. To win, the AI must combine both.
    • Scenario B: They create a fake option that is a summer castle but with a note saying "winter." If the AI only looks at the text, it gets confused.

They used advanced AI tools to generate these tricky examples automatically. It's like a coach who keeps changing the rules of the game so the player can't rely on old tricks and must learn to play the whole game properly.

The Results

When they trained the AI on this new, tougher "Boot Camp" data:

  • The Shortcuts Disappeared: The AI stopped ignoring one part of the input. It started looking at both the image and the text equally.
  • Better Performance: The AI got much better at solving the hard, tricky puzzles (the "Hard Cases").
  • Still Good at the Basics: Interestingly, getting better at the hard stuff didn't make them worse at the easy stuff. They became more robust and reliable overall.

The Big Picture

This paper is like a mechanic realizing that cars are failing on icy roads because they were only tested on dry pavement.

  • The Mechanic (FBCIR): Checks the car and realizes the tires are only gripping the left side.
  • The Fix (FBCIR-Data): Takes the car to a special training track with ice and mud to force the tires to learn how to grip properly on both sides.
  • The Outcome: The car is now safe and reliable on any road, not just the easy ones.

In short, the paper teaches AI models to stop taking shortcuts and start paying attention to everything in the picture and the text, making them much smarter and more reliable for real-world use.