MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning

The paper introduces MORE-R1, a novel Large Vision-Language Model that leverages a two-stage training process combining Supervised Fine-Tuning on automatically constructed stepwise reasoning data and Reinforcement Learning with Group Relative Policy Optimization to achieve state-of-the-art performance in Multimodal Object-Entity Relation Extraction.

Xiang Yuan, Xu Chu, Xinrong Chen, Haochen Li, Zonghong Dai, Hongcheng Fan, Xiaoyue Yuan, Weiping Li, Tong Mo

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are a detective trying to solve a mystery, but you have two very different clues: a photograph and a newspaper headline. Your job is to figure out the relationship between a specific person in the photo and a specific word in the headline.

For example:

  • Photo: A basketball player wearing a green jersey with the number 0.
  • Headline: "After 4 Lopsided Games, Heat and Celtics Are Tied Up."
  • The Mystery: What is the relationship between the player in the photo and the word "Heat"?

Most computer programs today are like detectives who just guess the answer based on a quick glance. They might say, "Oh, they look like partners!" because they see two sports teams. But they miss the nuance: the player is on the Celtics team, and the headline says the Celtics are playing against the Heat. The real answer is "Opponents."

This paper introduces MORE-R1, a new "Super Detective" that doesn't just guess. It thinks before it speaks.

Here is how it works, broken down into simple steps:

1. The Problem: The "Guessing Game" is Broken

Old methods are like students who memorized a list of 21 possible answers (like "Partner," "Opponent," "Leader," etc.). If the answer isn't on their list, they fail. If the situation is tricky, they get confused. They also can't explain why they chose an answer, making them unreliable.

2. The Solution: A Two-Stage Training Camp

The authors built MORE-R1 using a powerful AI brain (called an LVLM) and trained it in two distinct phases, like a sports team preparing for the big game.

Stage 1: The "Textbook" Learning (Cold-Start)

Imagine teaching a student not just the answer, but how to solve the problem step-by-step.

  • The Trick: Instead of showing the AI thousands of examples with just the final answer, the researchers used a super-smart AI (GPT-4o) to write out the entire thought process for a few thousand examples.
  • The Steps: The AI learns to think like this:
    1. Look at the picture: "That's a basketball player."
    2. Read the text: "It mentions the Celtics and the Heat."
    3. Connect the dots: "The player is wearing a Celtics jersey."
    4. Check the rules: "Celtics and Heat are rivals."
    5. Conclusion: "Therefore, the relationship is 'Opposed'."
  • Result: The AI learns the habit of reasoning. It stops guessing and starts analyzing.

Stage 2: The "Hard Mode" Boot Camp (Reinforcement Learning)

Now that the AI knows how to think, it needs to get better at the toughest cases.

  • The Problem: If you only practice on easy questions, you won't improve. If you only practice on impossible questions, you might get frustrated and give up.
  • The Creative Fix (Progressive Sample-Mixing): The researchers created a special training schedule.
    • Early days: The AI gets a mix of easy and hard questions (50/50). This keeps it confident.
    • Later days: As the AI gets smarter, they slowly swap out the easy questions for harder ones.
    • The Goal: It's like a video game that gets slightly harder every level. The AI is forced to stretch its brain to solve the difficult puzzles without getting overwhelmed.

3. The Reward System: The "Gold Star"

How does the AI know it's doing a good job? It gets a "score" (Reward) based on three things:

  1. Format: Did it follow the steps? (Did it write out its thoughts?)
  2. Effort: Did it think long enough? (No lazy, one-sentence answers!)
  3. Accuracy: Did it get the final answer right?

If the AI gets all three, it gets a "Gold Star" and learns to do it again. If it fails, it tries a different path next time.

4. The Result: Why It Matters

When they tested MORE-R1 on the "Mystery" (the MORE benchmark), it crushed the competition.

  • Old AI: "I think they are partners." (Wrong)
  • MORE-R1: "Step 1: Player is Celtics. Step 2: Text says Heat vs. Celtics. Step 3: They are rivals. Answer: Opponents." (Correct!)

The Big Picture

MORE-R1 is a breakthrough because it teaches AI to show its work. Just like in school, when you show your math steps, the teacher can see if you understand the logic, not just if you got the right number.

By combining step-by-step thinking with a smart training schedule that gradually increases difficulty, this model can solve complex visual puzzles that previous AI models found impossible. It's the difference between a parrot that repeats words and a detective who actually solves the case.