Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

The Big Idea: The "Lazy Describer" Problem

Imagine you are trying to teach a robot how to understand the world by showing it millions of photos and the captions people wrote for them. You might think, "If I show the robot enough photos, it will eventually figure out everything."

This paper argues that quantity isn't enough. Even if you give the robot a billion photos, it will still be bad at certain tasks (like counting, understanding space, or knowing what didn't happen).

Why? Because of Reporting Bias.

Think of it like this: When humans describe a photo, we are naturally "lazy" or "efficient." We only say what is necessary to get the point across. We don't usually say, "There are exactly 17 people standing behind the field," unless it's a crime scene. We just say, "A game today!"

The robot learns from these lazy descriptions. So, it never learns that "17 people" is a specific fact, or that "behind" is a specific spatial relationship. It just learns the vibe.

The Four Skills the Robot Missed

The researchers found that human captions naturally skip over four specific types of thinking:

Counting: We rarely say "three dogs." We say "a pack of dogs." The robot never learns to count because it's rarely asked to.
Spatial Reasoning: We rarely say "the cup is to the left of the plate." We just say "a cup and a plate." The robot doesn't learn the difference between left and right.
Negation: We rarely say "There are no parrots in this picture." We just describe what is there. The robot struggles to understand what is missing.
Time: We rarely say "The ball will fall after the throw." We just describe the current moment. The robot gets confused about cause and effect.

The "More Data" Myth (Scaling)

For a long time, AI researchers believed in the "Scaling Law": If you just make the model bigger and feed it more data, it will magically become smart.

The paper says: No.

The Analogy: Imagine you are trying to teach a student how to solve a math problem, but you only give them textbooks that say "The answer is 42" without showing the steps.
The Result: If you give that student 1,000 textbooks, they still won't know how to get to 42. They just memorize the answer.
The Finding: The researchers tried feeding the AI more data, bigger models, and even data in different languages. The AI got slightly better at recognizing things (like "is this a cat?"), but it did not get better at reasoning (like "how many cats?"). The "lazy" way humans write captions didn't change just because the dataset got bigger.

The Solution: Changing the Instructions

If the problem is that humans are too lazy to write detailed captions, the solution is to tell them to stop being lazy.

The researchers ran an experiment where they gave human annotators specific instructions, like:

"Please count exactly how many objects are in this picture."
"Describe exactly where the objects are relative to each other."
"Mention what is not in the picture."

The Result:
When people were explicitly told to include these details, the number of "reasoning" captions skyrocketed.

Without instructions: 2% of captions had counting details.
With instructions: 39% of captions had counting details.

When they took this new, "intentional" data and taught the AI, the AI got much better at reasoning.

The "Synthetic" Trap

The paper also found a funny twist: Even when AI (like GPT-4) writes captions for other AI, it copies the same bad habits! Since the AI was trained on human text, it also learned to be "lazy" and skip the details. So, just using AI to generate more data doesn't fix the problem unless you give the AI very strict instructions.

The Takeaway

You can't just "scale" your way to intelligence.

If you want an AI to be good at reasoning, you can't just throw more data at it. You have to be intentional. You have to curate the data carefully and tell the people (or other AIs) writing the descriptions exactly what details to include.

In short: Don't just feed the robot a billion photos of "a game." Tell the robot, "Look at this photo, count the players, tell me who is on the left, and tell me who is missing." That's how you teach it to think.

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

The Big Idea: The "Lazy Describer" Problem

The Four Skills the Robot Missed

The "More Data" Myth (Scaling)

The Solution: Changing the Instructions

The "Synthetic" Trap

The Takeaway

1. Problem Statement

2. Methodology

A. Theoretical Analysis & Data Audit

B. Benchmark Creation

C. Scaling Law Experiments

D. Annotator Instruction Study (User Study)

E. Fine-tuning Validation

3. Key Results

A. Prevalence of Reporting Bias

B. Failure of Scaling

C. Effectiveness of Instructions

4. Key Contributions

5. Significance and Implications

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

The Big Idea: The "Lazy Describer" Problem

The Four Skills the Robot Missed

The "More Data" Myth (Scaling)

The Solution: Changing the Instructions

The "Synthetic" Trap

The Takeaway

1. Problem Statement

2. Methodology

A. Theoretical Analysis & Data Audit

B. Benchmark Creation

C. Scaling Law Experiments

D. Annotator Instruction Study (User Study)

E. Fine-tuning Validation

3. Key Results

A. Prevalence of Reporting Bias

B. Failure of Scaling

C. Effectiveness of Instructions

4. Key Contributions

5. Significance and Implications

More like this

One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

The Geometric Anatomy of Capability Acquisition in Transformers

Disentangling Prompt Element Level Risk Factors for Hallucinations and Omissions in Mental Health LLM Responses

ASCAT: An Arabic Scientific Corpus and Benchmark for Advanced Translation Evaluation

Semantic Shifts of Psychological Concepts in Scientific and Popular Media Discourse: A Distributional Semantics Analysis of Russian-Language Corpora