From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG

The paper proposes MA-RAG, a multi-round agentic RAG framework that iteratively refines medical reasoning by transforming semantic conflicts into targeted evidence retrieval and optimizing reasoning traces, thereby achieving substantial accuracy improvements over existing baselines across seven medical benchmarks.

Wenhao Wu, Zhentao Tang, Yafu Li, Shixiong Kai, Mingxuan Yuan, Zhenhong Sun, Chunlin Chen, Zhi Wang

Published 2026-03-05
📖 5 min read🧠 Deep dive

🩺 The Problem: The "Overconfident Doctor"

Imagine a brilliant medical student (the AI) who has read every textbook in the library. They are great at answering questions, but they have two big flaws:

  1. Hallucinations: Sometimes, they confidently make up facts that sound real but are wrong.
  2. Outdated Knowledge: Their textbooks are from 2020, but medicine changes fast. They don't know the latest guidelines.

Standard AI tools try to fix this by looking up answers in a library (Retrieval-Augmented Generation, or RAG). But most of these tools are like a student who asks one question, gets one book, and immediately writes an essay. If the first book they picked was slightly off-topic, they get the whole answer wrong. They don't know when to stop and ask for more help.

🚀 The Solution: The "Roundtable of Experts" (MA-RAG)

The authors propose a new system called MA-RAG. Instead of a single student writing an essay alone, imagine a medical roundtable where a team of agents works together in a loop until they agree on the perfect answer.

Think of it like a detective solving a mystery, but with three specific roles:

1. The Solver (The Brainstorming Team)

  • What they do: The AI generates several different possible answers to the medical question at once.
  • The Analogy: Imagine a group of detectives all writing down their own theory of the crime on a whiteboard.
  • The Magic: If all detectives write the same theory, they are probably right. But if they write different theories (e.g., "It was the butler" vs. "It was the gardener"), that's a Conflict. In MA-RAG, Conflict is good! It tells the system, "Hey, we don't know the answer yet. We need more info."

2. The Retrieval Agent (The Investigator)

  • What they do: This agent looks at the conflicting theories and asks, "Where are we disagreeing?" It turns that disagreement into a specific search query to find the truth in a medical database.
  • The Analogy: The detectives realize they are arguing about the time of death. The Investigator doesn't just search "Who killed the victim?" (too vague). Instead, they search specifically for "Time of death in cases of poisoning." They go out and bring back new, specific evidence to settle the argument.
  • The Upgrade: Unlike older systems that search based on "how unsure the AI feels" (which is often wrong), this agent searches based on actual disagreement between the answers.

3. The Ranking Agent (The Editor)

  • What they do: As the team gathers more evidence and writes more theories, the conversation gets long and messy. The Ranking Agent sorts the whiteboard. It puts the best, most accurate theories at the top and pushes the bad ones to the bottom.
  • The Analogy: Imagine a teacher grading a stack of essays. Instead of reading them in the order they were written, the teacher puts the "A+" essays at the top of the pile so the student can learn from the best examples first. This prevents the "Lost in the Middle" problem, where important clues get buried in a long chat history.

🔄 The Loop: From Conflict to Consensus

The system doesn't stop after one try. It runs in a loop:

  1. Generate: The team writes 5 different answers.
  2. Check: They see they disagree on a specific medical fact (The Conflict).
  3. Search: The Investigator finds the correct fact in the medical database.
  4. Refine: The team reads the new fact, throws out the wrong answers, and writes a new, better set of answers.
  5. Repeat: They keep doing this until everyone agrees (Consensus).

This is like Boosting in machine learning. In simple terms, it's like a coach telling a player, "You missed that shot. Here is the video of the mistake. Try again, but focus on fixing that specific error." The system iteratively fixes its own mistakes until the error is gone.

🏆 Why It Matters

The paper tested this on 7 different medical exams (from basic licensing to expert-level clinical reasoning).

  • The Result: MA-RAG beat all other methods, including standard AI and other "search-and-answer" tools.
  • The Gain: It improved accuracy by nearly 7 points on average. On the hardest exams, it improved by 37%.
  • The Takeaway: By letting the AI argue with itself, find its own mistakes, and fetch specific evidence to fix them, it becomes much safer and more reliable for real-world healthcare.

🌟 Summary Metaphor

If standard AI is a student taking a test alone, MA-RAG is a study group with a librarian.

  • The student (Solver) tries to answer.
  • The group notices they are arguing (Conflict).
  • The librarian (Retrieval Agent) runs to the shelves to find the exact page that settles the argument.
  • The group leader (Ranking Agent) organizes the notes so everyone learns from the best ideas.
  • They keep studying until they all agree on the correct answer.

This ensures that when the AI gives medical advice, it's not just guessing confidently; it's proven to be right through evidence and consensus.