Multimodal LLMs Do Not Compose Skills Optimally Across Modalities

Imagine you have a brilliant assistant who is incredibly smart at two very different things:

Reading Text: They can solve complex math problems and answer trivia questions just by reading.
Looking at Pictures: They can identify objects, read signs in photos, and count items in a scene.

You would think that if you asked this assistant to look at a picture of a math problem and then solve it, they would be perfect. After all, they are great at reading, and they are great at math.

The paper's big discovery is that they aren't.

Even though the assistant is a master of both skills individually, when you ask them to combine them to solve a single problem, they stumble. It's like a world-class chef who can bake a perfect cake and can also grill a perfect steak, but if you ask them to make a "Steak-and-Cake Sandwich," they get confused and serve you a burnt mess.

The Three "Test Drives"

The researchers put several popular AI models (like LLaVA, Qwen, and Molmo) through three simple tests to see how well they could "stitch" these skills together:

The "Photo of a Math Test" Challenge:
- The Task: Show the AI a photo of a math question written on paper (e.g., "What is 7 + 5?").
- The Skills Needed: First, the AI must read the text in the photo (like a scanner). Second, it must do the math.
- The Result: The AI often misreads the numbers in the photo or gets the math wrong, even though it can do both perfectly if given the text directly.
The "Count the Oranges" Challenge:
- The Task: Show the AI a picture of a fruit bowl and ask, "How many oranges are there?"
- The Skills Needed: First, spot the oranges in the picture. Second, count them using numbers.
- The Result: The AI might see the oranges but forget to count them, or count them but fail to identify them as oranges.
The "Card Game" Challenge:
- The Task: Show the AI a hand of playing cards and ask it to sort them or calculate a score based on complex rules (e.g., "Sum the value of all red cards").
- The Skills Needed: First, identify the cards (Is that a 7 of Hearts?). Second, apply the rules (Hearts are red, so double the value).
- The Result: The AI often gets the card identity right but messes up the math, or vice versa.

The "Cascaded" Hack

To prove the AI was capable of these skills, the researchers tried a "cheat code" called Cascaded Inference.

Instead of asking the AI to do everything in one go, they forced it to do it in two separate steps:

Step 1: "Hey AI, just look at this picture and tell me what text is written there." (The AI does this perfectly).
Step 2: "Okay, now take that text you just wrote down and solve the math problem." (The AI also does this perfectly).

The Gap: When they compared the "One-Step" attempt (Direct) to the "Two-Step" cheat code (Cascaded), the Two-Step method was almost always much better. This "gap" proves that the AI isn't failing because it can't read or can't count; it's failing because it can't coordinate the two skills smoothly in one brain.

Why Does This Happen?

Think of the AI's brain like a busy kitchen.

The Visual Team (eyes) sees the image.
The Text Team (brain) knows the rules.
In a "Direct" attempt, the Visual Team shouts the raw data to the Text Team, but the Text Team gets confused by the noise or misinterprets the visual signal.
In the "Cascaded" approach, the Visual Team writes a clean, organized note, and then hands it to the Text Team. The Text Team is much happier and performs better.

Can We Fix It?

The researchers tried two ways to fix this "coordination problem":

The "Step-by-Step" Prompt (Chain-of-Thought):
They told the AI, "First, look at the image and write down what you see. Then, use that to solve the problem."
- Result: It helped! The AI got better at organizing its thoughts. But it wasn't a magic fix; it still wasn't as good as the two-step cheat code, and it required a very specific instruction for every new type of problem.
Special Training (Fine-Tuning):
They gave the AI extra homework specifically designed to practice combining these skills.
- Result: This helped the AI get better at the specific homework it practiced. However, it didn't fully solve the problem for new types of tasks. The AI was still struggling to be a generalist at combining skills.

The Bottom Line

Multimodal AI models are like super-talented specialists who are surprisingly bad at teamwork. They can read and they can see, but they haven't quite learned how to blend those abilities seamlessly.

The paper concludes that we need to figure out why this happens and invent new ways to train these models so they don't just have "eyes" and a "brain," but a single, unified mind that can use both at the same time without tripping over its own feet. Until then, even the smartest AI models will struggle with tasks that seem trivial to a human.

Multimodal LLMs Do Not Compose Skills Optimally Across Modalities

The Three "Test Drives"

The "Cascaded" Hack

Why Does This Happen?

Can We Fix It?

The Bottom Line

1. Problem Statement

2. Methodology

A. Evaluation Tasks

B. Inference Settings

C. Models and Datasets

3. Key Contributions

4. Key Results

5. Significance and Conclusion

Multimodal LLMs Do Not Compose Skills Optimally Across Modalities

The Three "Test Drives"

The "Cascaded" Hack

Why Does This Happen?

Can We Fix It?

The Bottom Line

1. Problem Statement

2. Methodology

A. Evaluation Tasks

B. Inference Settings

C. Models and Datasets

3. Key Contributions

4. Key Results

5. Significance and Conclusion

More like this

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents