From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs

Imagine you are trying to build a complex piece of IKEA furniture. You have the instruction manual (a bunch of paper diagrams) and you are holding the parts in your hands. Now, imagine you have a super-smart robot assistant watching you.

The Big Question: Can this robot look at your hands, look at the paper manual, and actually understand what you are doing? Can it tell you, "Hey, you just finished step 3, so now you need to grab the screw from step 4"?

This paper is about building a test track to see how good current AI robots are at being that helpful assistant.

1. The Problem: Old Tests vs. Real Life

Previously, researchers tested AI with simple tasks, like "Is there a cat in this picture?" or "What is the person doing right now?" (like chopping an onion).

But building furniture isn't just one action; it's a story. It's a sequence of events. You have to remember what you did five minutes ago to know what to do next. Old tests didn't check if the AI could follow the whole story. They were like testing a driver by asking them to stop at a red light, but not seeing if they could actually drive from New York to London.

2. The Solution: The "M2AD" Dataset

The authors created a new training ground called M2AD (Manual-to-Action Dataset).

The Ingredients: They took 53 different IKEA furniture videos from YouTube (real people building real things) and matched them perfectly with the official PDF instruction manuals.
The Labeling: They didn't just label "Step 1." They marked exactly when Step 1 started and ended in the video, and which page of the manual it was on.
The Twist: They made it "real." Real people make mistakes. They skip steps, go back to fix things, or take a coffee break. The dataset captures all that messy, human reality, not just a perfect, robotic version of assembly.

Think of M2AD as a gym for AI. Instead of lifting a single weight, the AI has to run a marathon while reading a map.

3. The Test: Putting AI to Work

The researchers took several "open-source" AI models (the kind you can run on a powerful home computer, not just a giant supercomputer) and put them through three specific challenges:

Challenge A: The "Did I Finish?" Check.
- Scenario: The AI sees a video clip of you screwing a leg onto a table and a picture of the manual page.
- Task: Can the AI say, "Yes, you are done with this step"?
- Result: Some AIs were okay (about 56% accurate), but many were barely better than guessing.
Challenge B: The "Where Am I?" Check.
- Scenario: The AI sees you working and is shown two pages from the manual (one is the right page, one is a trick page).
- Task: Can the AI pick the correct page that matches what you are doing?
- Result: Again, only the smartest models could do this reliably. Most got confused.
Challenge C: The "What Number Step?" Check.
- Scenario: The AI sees you working and is shown two pages.
- Task: Can the AI say, "You are currently on Step 12"?
- Result: This was the hardest. Most AIs failed miserably. One model, MolMo, did surprisingly well, but it cheated a little by just looking at the layout of the image (like saying "The text on the left says 12") rather than truly understanding the logic.

4. The Verdict: Promising, But Not Ready for Prime Time

The paper concludes that while AI is getting smarter, it still struggles with long, complex stories.

The Good News: AI is getting better at understanding that a video and a manual are connected. We don't need to teach the AI every single tiny detail anymore; it can learn some things on its own.
The Bad News: Current AI models have "short attention spans" and "weak eyes." They can't look at a whole video sequence and a whole manual at the same time without getting overwhelmed. They often get lost in the middle of the story.
The Hardware Hurdle: To make this work in a real factory or home, the AI needs to run on a regular computer, not a massive server farm. The current models are too heavy and slow for that.

The Bottom Line

This paper is like a report card for the next generation of robot helpers. It says: "You are getting better at reading the map and watching the road, but you still get lost if the trip is too long."

The authors hope that by using this new dataset (M2AD), developers will build better, more patient, and more observant AI assistants that can finally help us build our furniture without getting frustrated!

From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs

1. The Problem: Old Tests vs. Real Life

2. The Solution: The "M2AD" Dataset

3. The Test: Putting AI to Work

4. The Verdict: Promising, But Not Ready for Prime Time

The Bottom Line

1. Problem Statement

2. Methodology

A. The M2AD Dataset

B. Experimental Setup

3. Key Contributions

4. Results

5. Significance and Future Directions

From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs

1. The Problem: Old Tests vs. Real Life

2. The Solution: The "M2AD" Dataset

3. The Test: Putting AI to Work

4. The Verdict: Promising, But Not Ready for Prime Time

The Bottom Line

1. Problem Statement

2. Methodology

A. The M2AD Dataset

B. Experimental Setup

3. Key Contributions

4. Results

5. Significance and Future Directions

More like this

Founder effects shape the evolutionary dynamics of multimodality in open LLM families

Causal Direct Preference Optimization for Distributionally Robust Generative Recommendation

Graphs RAG at Scale: Beyond Retrieval-Augmented Generation With Labeled Property Graphs and Resource Description Framework for Complex and Unknown Search Spaces

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

Personalized Federated Sequential Recommender