Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling

This paper argues that the primary bottleneck in scaling multimodal large language models is not task format diversity but rather the knowledge density of training data, demonstrating that enriching image captions with structured knowledge yields more consistent performance improvements than adding task-specific supervision like Visual Question Answering.

Hongjian Zou, Yue Ge, Qi Ding, Yixuan Liao, Xiaoxin Chen

Published 2026-04-16
📖 4 min read☕ Coffee break read

The Big Idea: It's Not About the Game, It's About the Library

Imagine you are trying to teach a brilliant student (the AI) how to understand the world.

For a long time, researchers thought the best way to do this was to give the student more and more different types of tests. They thought, "If we make the student practice answering riddles, solving puzzles, and playing 'guess the object' games (these are called VQA or Visual Question Answering), they will get smarter."

This paper argues that this is wrong.

The authors say: "Stop worrying about the type of test. The problem isn't that the student hasn't played enough games. The problem is that the library the student is studying from is too thin."

They call this missing ingredient Knowledge Density.


Analogy 1: The "Rephrasing" Trick (Why VQA Doesn't Help Much)

Imagine you have a photo of a dog running on grass.

  • The Old Way (VQA): You show the photo and ask, "What animal is running?" The student answers, "A dog."
  • The New Way (Caption): You just write a sentence: "A dog is running on the grass."

The paper's experiments showed something surprising: The student learns the exact same thing from the sentence as they do from the question-and-answer game.

Think of it like this:

  • The Caption is the book containing the facts.
  • The VQA is just highlighting a specific sentence in that book and asking, "What did I just highlight?"

If you only have a thin book, highlighting different sentences (changing the question) doesn't give you more knowledge. You still only have the same thin book. The "game" format (VQA) just rearranges the information you already had; it doesn't add anything new.

The Takeaway: Adding more question-and-answer formats is like rearranging the furniture in a small room. It looks different, but the room is still small.


Analogy 2: The "Super-Book" (Why Knowledge Density Wins)

So, if changing the game format doesn't work, what does? Adding more facts to the book.

The authors tried a new method: Knowledge Injection.

Instead of just showing one picture of a dog, they showed two pictures side-by-side and wrote a description that compared them.

  • Old Caption: "Here is a dog."
  • New "Dense" Caption: "Here is a Shiba Inu dog running on green grass, while next to it is a Golden Retriever sleeping on a red rug. Notice how the Shiba has pointy ears and the Golden has floppy ears."

The Magic:
By forcing the AI to look at two things and describe the differences, relationships, and details, they packed way more "knowledge" into a single training example.

  • The Old Way: Feeding the AI 100 pictures with simple labels.
  • The New Way: Feeding the AI 100 pictures with rich, comparative stories that explain how things relate to each other.

The Result:
When they trained the AI on these "Super-Books" (dense knowledge), the AI got significantly smarter at reasoning, math, and understanding complex scenes. It didn't matter that they didn't change the "game" (the task format); they just made the "library" much richer.


The "Diminishing Returns" Problem

You might wonder, "Why didn't we figure this out sooner?"

Think of training an AI like building a skyscraper.

  • Text-only AI (like ChatGPT): We gave them a library with trillions of books. They built a skyscraper that goes straight up.
  • Multimodal AI (Image + Text): We gave them a library with only billions of books. We tried to make the building taller by adding more types of windows (different tasks), but the foundation (the knowledge) was too weak. The building kept hitting a ceiling.

The paper says: Don't add more window shapes. Add more books to the foundation.

Summary in Plain English

  1. The Myth: "If we give the AI more complex questions (VQA) to answer, it will get smarter."
  2. The Reality: "Those questions usually just repeat what's already in the image description. They don't add new facts."
  3. The Solution: "Stop focusing on the format of the question. Focus on the density of the information. Make the descriptions richer, more comparative, and more detailed."
  4. The Future: To build truly smart AI that can see and understand the world, we need to stop just "labeling" images and start "teaching" them through rich, knowledge-packed stories.

In short: It's not about how many questions you ask the AI. It's about how much truth you feed it.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →