How Well Do Multimodal Models Reason on ECG Signals?

Imagine you have a new, incredibly smart robot doctor. You show it a picture of a patient's heartbeat (an ECG), and it says, "This patient has an irregular heartbeat called Atrial Fibrillation."

But here's the catch: How do you know it's actually looking at the picture, or if it's just guessing and making up a story to sound smart?

This is the problem the paper "How Well Do Multimodal Models Reason on ECG Signals?" tries to solve. The authors built a "lie detector" test for AI doctors to see if they are truly reasoning or just hallucinating.

Here is the simple breakdown of their solution, using some everyday analogies:

The Problem: The "Magic Trick" of AI

Current AI models are great at giving the right answer (like a magician pulling a rabbit out of a hat), but they are terrible at explaining how they did it.

The Old Way: To check if the AI was telling the truth, you had to hire a real human doctor to read the AI's explanation. This is slow, expensive, and you can't do it for every single patient.
The New Way: The authors created a system called ECG ReasonEval that acts like a two-part referee game.

The Solution: The Two-Part Referee

The authors realized that "reasoning" is actually two different skills mixed together. They split the test into two distinct parts: Perception and Deduction.

1. Perception: "Do you actually see what's on the paper?"

The Analogy: Imagine a student taking a math test. The teacher asks, "What is the number written on the board?"
- If the student says, "It's a 7," but the board clearly says "1," they failed Perception. They didn't actually look at the data.
How the AI Test Works:
- The AI says, "I see the heartbeat is skipping beats."
- The Perception Agent (a robot coder) instantly writes a tiny computer program to check the raw heartbeat data.
- It counts the beats. If the data doesn't show skipping beats, the AI gets a red flag: "You are hallucinating! You didn't see what you said you saw."

2. Deduction: "Does your logic make sense to a doctor?"

The Analogy: Now imagine the student sees the number "7" correctly. They then say, "Because it is a 7, the answer to the problem is 'Blue'."
- They saw the number right (good Perception), but their logic is nonsense (bad Deduction).
How the AI Test Works:
- The AI says, "The heartbeat is skipping, so the patient has Atrial Fibrillation."
- The Deduction Agent takes that sentence and searches a massive library of medical textbooks and guidelines.
- It asks: "Do medical experts agree that skipping beats always mean Atrial Fibrillation?"
- If the AI's logic matches the textbooks, it gets a green light. If it makes up a weird connection that no doctor would ever make, it gets a red flag.

What Did They Find? (The Plot Twist)

The authors tested several "smart" AI models (including big names like Claude and Gemini) and found some surprising results:

The "Fake It Till You Make It" AI: Some models were great at Deduction (they knew the medical rules) but terrible at Perception.
- The Metaphor: These models are like a student who memorized the answer key but never looked at the test questions. They guessed the right disease, then made up a fake reason to justify it. They said, "I see deep Q-waves," when the picture clearly showed none. This is dangerous because it sounds confident but is lying about the evidence.
The "Dull Sensor" AI: Other models were great at Perception but terrible at Deduction.
- The Metaphor: These are like a security camera that sees everything clearly but has no brain. They can say, "I see a weird wave," but they don't know what it means. They can't connect the dots to give a diagnosis.
The Best (But Still Imperfect) Candidate: The newest models (like Gemini) were the best at balancing both, but they still weren't as good as a human doctor. They are getting closer to being "trustworthy," but they aren't ready to replace doctors yet.

Why Does This Matter?

In medicine, you can't just trust an AI because it got the right answer. If an AI says, "The patient is fine," but it hallucinated that the heartbeat was normal when it wasn't, the patient could die.

This paper gives us a way to audit AI doctors without needing a human to read every single report. It separates seeing from thinking, ensuring that when an AI gives a diagnosis, it's actually looking at the patient and using real medical logic, not just making up a story.

In short: The paper teaches us how to stop trusting the AI's "confidence" and start checking its "eyes" and its "brain" separately.

Here is a detailed technical summary of the paper "How Well Do Multimodal Models Reason on ECG Signals?"

1. Problem Statement

While Multimodal Large Language Models (MLLMs) are increasingly used in healthcare to generate interpretable "reasoning traces" (Chain-of-Thought), verifying the validity of these traces remains a critical bottleneck. Current evaluation methods suffer from two main limitations:

Unscalability: Reliance on manual review by expert clinicians is not feasible for large-scale benchmarking.
Superficiality: Proxy metrics like Question Answering (QA) accuracy or n-gram matching (BLEU) measure final answer correctness but fail to capture the semantic validity of the intermediate reasoning steps.
The "Black Box" of Reasoning: It is unclear whether models genuinely reason based on signal data or merely "hallucinate" justifications post-hoc to match a predicted diagnosis.

The paper addresses the need for a scalable, reproducible framework to evaluate the semantic correctness of reasoning in medical time-series data, specifically Electrocardiogram (ECG) signals.

2. Methodology: ECG ReasonEval Framework

The authors propose ECG ReasonEval, a framework that decomposes reasoning evaluation into two independent, verifiable axes: Perception and Deduction.

A. Perception Phase (Signal Grounding)

Goal: Verify if the reasoning trace accurately describes the observable patterns in the raw ECG signal.
Mechanism: An Agentic Framework is employed where a specialized LLM (Data Science Agent) acts as a verifier.
1. Parsing: The agent extracts discrete, verifiable claims from the unstructured reasoning text (e.g., "RR intervals are irregularly irregular").
2. Code Generation: The agent dynamically writes and executes Python code to empirically test these claims against the raw ECG signal.
3. Tooling: The agent utilizes a specialized segmentation tool (based on a State-of-the-Art Deep Learning model) to delineate wave boundaries (P, QRS, T) and calculate metrics (e.g., RR interval variance, QRS duration).
Output: A boolean verification status for each claim.
Validation: The system is validated against ground-truth cardiologist notes (Supporting Assessment) and adversarial examples where claims are inverted (Adversarial Assessment) to ensure the agent does not blindly validate inputs.

B. Deduction Phase (Clinical Consensus)

Goal: Verify if the logical application of domain knowledge aligns with established medical criteria.
Mechanism: A Retrieval-Augmented Generation (RAG) approach.
1. Knowledge Base Construction: A structured database of diagnostic criteria is built by scraping and cleaning authoritative sources (LITFL, Wikipedia, ECGPedia, WikiEM). A "Text Cleaning Agent" extracts specific criteria for each diagnosis.
2. Sanitization: The input reasoning trace is "censored" to remove the final diagnostic label to prevent data leakage.
3. Retrieval: The sanitized reasoning is embedded and used to query the knowledge base.
4. Alignment Check: The system retrieves the top- $k$ most similar criteria and checks if their associated labels match the ground-truth diagnosis of the input signal.
Metric: Precision@k, measuring how often the retrieved criteria match the correct pathology.

3. Key Contributions

ECG ReasonEval Framework: The first reproducible, automated framework for evaluating the semantic correctness of reasoning traces in multimodal time-series models, eliminating the need for manual expert review at scale.
Novel Decomposition: Separating reasoning into Perception (fidelity to input data) and Deduction (alignment with clinical consensus), allowing for independent auditing of signal processing vs. medical logic.
Comprehensive Evaluation: The framework successfully audits various models (TSLMs and Frontier LLMs) and even identifies errors in human expert annotations (ground truth), demonstrating its reliability as a quality assurance tool.

4. Experimental Results

The framework was applied to evaluate Time-Series Language Models (TSLMs like OpenTSLM, QoQ-Med) and Frontier Multimodal LLMs (Claude Opus 4.5, Gemini 3.1 Pro).

Perception Performance:
- TSLMs generally performed better at Perception (identifying signal features) than general LLMs, verifying ~25–30% of traces at 100% accuracy.
- Gemini 3.1 Pro showed the most consistent Perception performance across tasks (~15–16% Acc@Thresh100%).
- Crucial Finding: No model approached the Perception performance of human physicians.
Deduction Performance:
- Frontier LLMs (especially Gemini 3.1 Pro) significantly outperformed TSLMs in Deduction, achieving higher Precision@k scores by aligning with clinical literature.
- TSLMs often failed to link observed features to the correct diagnosis (low Deduction).
The "Hallucination" Gap:
- High Deduction / Low Perception: Models like Claude Opus 4.5 often predicted the correct diagnosis (high Deduction) but hallucinated signal features to justify it (e.g., claiming "deep QS complexes" were present when they were not). This indicates a "post-hoc reasoning" mechanism.
- High Perception / Low Deduction: Models like OpenTSLM correctly identified features (e.g., inverted T-waves) but failed to deduce the correct pathology (e.g., Left Ventricular Hypertrophy), acting as accurate sensors without clinical understanding.
Correlation: There is a weak correlation ( $r=0.18$ ) between Perception and final classification accuracy, suggesting models can achieve high predictive accuracy without truly "seeing" the signal.

5. Significance and Impact

Trustworthiness in Healthcare: The paper demonstrates that high predictive accuracy does not equate to trustworthy reasoning. A model can be "right for the wrong reasons" (hallucinating features) or "wrong for the right reasons" (missing the diagnosis despite seeing the signal).
Scalable Auditing: ECG ReasonEval provides a scalable alternative to manual clinician review, enabling rigorous auditing of AI outputs in high-stakes medical applications.
Error Detection: The framework is robust enough to flag potential errors in human expert annotations, serving as a quality control mechanism for medical datasets.
Future Direction: The authors argue that developing truly trustworthy medical AI requires optimizing for both Perception and Deduction simultaneously, rather than just final classification accuracy. The framework is modality-agnostic and can be extended to other clinical domains (e.g., radiology).