MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning

Imagine you are a detective trying to solve a mystery, but you have two very different clues: a photograph and a newspaper headline. Your job is to figure out the relationship between a specific person in the photo and a specific word in the headline.

For example:

Photo: A basketball player wearing a green jersey with the number 0.
Headline: "After 4 Lopsided Games, Heat and Celtics Are Tied Up."
The Mystery: What is the relationship between the player in the photo and the word "Heat"?

Most computer programs today are like detectives who just guess the answer based on a quick glance. They might say, "Oh, they look like partners!" because they see two sports teams. But they miss the nuance: the player is on the Celtics team, and the headline says the Celtics are playing against the Heat. The real answer is "Opponents."

This paper introduces MORE-R1, a new "Super Detective" that doesn't just guess. It thinks before it speaks.

Here is how it works, broken down into simple steps:

1. The Problem: The "Guessing Game" is Broken

Old methods are like students who memorized a list of 21 possible answers (like "Partner," "Opponent," "Leader," etc.). If the answer isn't on their list, they fail. If the situation is tricky, they get confused. They also can't explain why they chose an answer, making them unreliable.

2. The Solution: A Two-Stage Training Camp

The authors built MORE-R1 using a powerful AI brain (called an LVLM) and trained it in two distinct phases, like a sports team preparing for the big game.

Stage 1: The "Textbook" Learning (Cold-Start)

Imagine teaching a student not just the answer, but how to solve the problem step-by-step.

The Trick: Instead of showing the AI thousands of examples with just the final answer, the researchers used a super-smart AI (GPT-4o) to write out the entire thought process for a few thousand examples.
The Steps: The AI learns to think like this:
1. Look at the picture: "That's a basketball player."
2. Read the text: "It mentions the Celtics and the Heat."
3. Connect the dots: "The player is wearing a Celtics jersey."
4. Check the rules: "Celtics and Heat are rivals."
5. Conclusion: "Therefore, the relationship is 'Opposed'."
Result: The AI learns the habit of reasoning. It stops guessing and starts analyzing.

Stage 2: The "Hard Mode" Boot Camp (Reinforcement Learning)

Now that the AI knows how to think, it needs to get better at the toughest cases.

The Problem: If you only practice on easy questions, you won't improve. If you only practice on impossible questions, you might get frustrated and give up.
The Creative Fix (Progressive Sample-Mixing): The researchers created a special training schedule.
- Early days: The AI gets a mix of easy and hard questions (50/50). This keeps it confident.
- Later days: As the AI gets smarter, they slowly swap out the easy questions for harder ones.
- The Goal: It's like a video game that gets slightly harder every level. The AI is forced to stretch its brain to solve the difficult puzzles without getting overwhelmed.

3. The Reward System: The "Gold Star"

How does the AI know it's doing a good job? It gets a "score" (Reward) based on three things:

Format: Did it follow the steps? (Did it write out its thoughts?)
Effort: Did it think long enough? (No lazy, one-sentence answers!)
Accuracy: Did it get the final answer right?

If the AI gets all three, it gets a "Gold Star" and learns to do it again. If it fails, it tries a different path next time.

4. The Result: Why It Matters

When they tested MORE-R1 on the "Mystery" (the MORE benchmark), it crushed the competition.

Old AI: "I think they are partners." (Wrong)
MORE-R1: "Step 1: Player is Celtics. Step 2: Text says Heat vs. Celtics. Step 3: They are rivals. Answer: Opponents." (Correct!)

The Big Picture

MORE-R1 is a breakthrough because it teaches AI to show its work. Just like in school, when you show your math steps, the teacher can see if you understand the logic, not just if you got the right number.

By combining step-by-step thinking with a smart training schedule that gradually increases difficulty, this model can solve complex visual puzzles that previous AI models found impossible. It's the difference between a parrot that repeats words and a detective who actually solves the case.

Here is a detailed technical summary of the paper "MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning."

1. Problem Definition: Multimodal Object-Entity Relation Extraction (MORE)

The paper addresses the MORE task, a challenging information extraction problem that aims to identify the semantic relationship between a specific visual object (defined by a bounding box in an image) and a textual entity mentioned in a accompanying text description.

Core Challenge: Unlike traditional relation extraction, MORE requires cross-modal reasoning. The model must not only understand the image and text independently but also establish precise correspondences between visual objects and textual entities, then infer complex relationships (e.g., "opposed to," "member of," "leader of").
Limitations of Existing Methods:
- Classification-based methods: Rely on predefined categories and small-scale Vision-Language Models (VLMs). They suffer from poor scalability (cannot handle new relation types without retraining) and lack reasoning transparency.
- Generation-based methods (without reasoning): Directly prompt Large Vision-Language Models (LVLMs) to output labels. These often fail due to the scarcity of relation-specific data in pre-training and the high semantic complexity of discrete labels, leading to poor performance and a "black box" decision process.

2. Methodology: MORE-R1 Framework

The authors propose MORE-R1, a novel framework that leverages a Large Vision-Language Model (specifically Qwen2.5-VL) as the backbone. The core innovation is the integration of explicit stepwise reasoning guided by a two-stage training process:

Stage 1: Cold-Start Training (Supervised Fine-Tuning - SFT)

Goal: To teach the LVLM a fundamental "reasoning paradigm" tailored to the MORE task.
Data Construction: Since manual annotation of reasoning steps is costly, the authors use an expert model (GPT-4o) to automatically generate high-quality training data.
- They reformulate the task into a 6-step progressive reasoning process:
  1. Image/Object analysis.
  2. Cross-modal relevance assessment.
  3. Cross-modal alignment (linking object to entity).
  4. Entity type identification (Person, Org, Loc, Misc).
  5. Preliminary relation filtering (narrowing candidates based on types).
  6. Final relation determination.
Process: The LVLM is fine-tuned on this synthetic dataset to learn to output intermediate reasoning steps (enclosed in <thought> tags) before the final answer.

Stage 2: Reinforcement Learning (RL) Optimization

Goal: To enhance the model's ability to handle complex, hard samples and refine its reasoning trajectory.
Algorithm: The paper employs Group Relative Policy Optimization (GRPO), an on-policy RL algorithm that eliminates the need for a separate value model, improving computational efficiency.
Reward Function: A composite rule-based reward is designed to guide the model:
- Format Reward: Ensures the output follows the strict 6-step reasoning template.
- Length Reward: Encourages thorough thinking (output length > 1024 tokens) to avoid hasty judgments.
- Answer Reward: Grants positive feedback only if the final predicted relation label is correct.
Key Innovation: Progressive Sample-Mixing Strategy:
- The authors observed that training RL solely on "hard" samples (those the model got wrong in Stage 1) leads to instability, while training on all remaining data (mostly "easy" samples) causes overfitting on simple patterns.
- Solution: A curriculum-like strategy where the ratio of hard samples to easy samples in each mini-batch is gradually increased as training epochs progress.
- Mechanism: Initially, the model sees a mix (e.g., 1:1). As training proceeds, the proportion of easy samples decays, forcing the model to focus increasingly on challenging cases while maintaining stability through the initial warm-up.

3. Key Contributions

MORE-R1 Model: The first successful adaptation of an LVLM to the MORE task using explicit stepwise reasoning and RL, achieving State-of-the-Art (SOTA) performance.
Automated Data Construction: An efficient strategy using an expert model to generate fine-grained, stepwise reasoning data for SFT, overcoming the bottleneck of manual annotation.
Progressive Sample-Mixing Strategy: A novel RL training technique that dynamically adjusts the difficulty of training samples, stabilizing the learning process and significantly boosting performance on complex scenarios.
Enhanced Transparency & Scalability: By generating reasoning chains, the model improves interpretability and can handle relation types not strictly limited to a fixed classifier head (generation-based approach).

4. Experimental Results

The model was evaluated on the standard MORE benchmark (20,264 samples, 21 relation types).

Performance: MORE-R1 achieved 84.91% Accuracy and 67.80% F1 Score, significantly outperforming:
- The previous SOTA classification-based method (REMOTE): +1.5% Acc, +6.1% F1.
- A generation-based baseline without reasoning (Qwen2.5-VL-SFT): +2.7% Acc, +13.8% F1.
Ablation Studies:
- Stage 1 only: Showed significant improvement over direct SFT, proving the value of the reasoning paradigm.
- Progressive Mixing: The variant with the decay factor $\alpha=0.5$ (gradually increasing hard samples) outperformed strategies using all data ("raw") or only hard samples (" $\alpha \to 0$ ").
Case Studies: Visualizations showed that while baseline models often misidentified entity types or missed implicit adversarial relationships (e.g., confusing "partner" with "opposed to" in a sports context), MORE-R1 successfully traced the logical steps to identify the correct "opposed to" relationship.

5. Significance

This work represents a paradigm shift in Multimodal Relation Extraction. It moves away from rigid classification heads toward generative, reasoning-driven approaches powered by RL. By demonstrating that LVLMs can be effectively guided to perform complex cross-modal reasoning through structured training (SFT + RL with curriculum mixing), the paper provides a robust blueprint for applying Large Reasoning Models to other complex multimodal tasks where interpretability and handling of long-tail/complex scenarios are critical.