Show Your Work: Verbatim Evidence Requirements and Automated Assessment for Large Language Models in Biomedical Text Processing

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a hiring manager trying to decide which candidates are eligible for a very specific job. You have a stack of 200 resumes (clinical trial abstracts), and you need to know: Can this person work in a local office, a remote office, or both?

In the past, you might have asked a super-smart AI assistant to just give you a list of "Yes" or "No" answers. But here's the problem: AI is like a confident student who sometimes guesses the right answer but has no idea why. If you ask, "How did you know?" it might just make up a reason or stare blankly. In medicine, where lives are at stake, a confident guess isn't good enough. You need proof.

This paper is about a new experiment: What happens if we force the AI to "show its work" by pointing to the exact sentence in the resume that proves its answer?

The Experiment: The "Show Your Work" Test

The researchers took three of the world's most advanced AI models (think of them as three different super-intelligent geniuses: one from OpenAI, one from Google, and one from Anthropic). They gave them the same 200 medical trial summaries.

They ran the test in two ways:

The "Just Give Me the Answer" Mode: The AI just says "Local," "Remote," or "Both."
The "Show Your Work" Mode: The AI must say "Local" AND highlight the exact sentence in the text that proves it.

Crucially, the AI wasn't allowed to summarize or paraphrase. It had to copy-paste the exact words from the text, like a student underlining a sentence in a textbook to prove they read it.

What They Found: The Good, The Bad, and The "Wait, Really?"

Here is the breakdown of what happened, using some simple analogies:

1. The "Honesty" Trade-off (Coverage vs. Accuracy)
When the AI had to show its work, it became more honest.

Before: The AI would confidently guess on everything, even if the resume was vague. It answered 98% of the time.
After: When forced to find proof, the AI realized, "Hey, I can't find a sentence that proves this!" So, it stopped guessing. It said, "I don't know," more often.
The Result: The AI answered fewer questions (coverage dropped), but the answers it did give were often more reliable. It's like a student who used to guess on every test question but now only answers the ones they are 100% sure of.

2. The "Copy-Paste" Glitch (Mechanical Validity)
The researchers checked if the AI actually copied the text correctly.

The Good News: Most of the time, the AI did a great job. It copied the sentence exactly (like 83% to 91% of the time).
The Bad News: Sometimes the AI got lazy or confused. It might have added a period that wasn't there, or missed a word. It was like a student who underlined the right sentence but accidentally included the teacher's name in the margin. The system caught these errors automatically.

3. The "Confident but Wrong" Problem (Semantic Support)
This was the most interesting part. The researchers used a second AI to act as a "Teacher" to grade the first AI's work.

The Scenario: The first AI said, "This candidate is eligible for Remote work," and pointed to a sentence.
The Teacher's Verdict: The Teacher AI looked at the sentence and said, "Wait, that sentence doesn't actually prove they can work remotely. You're just guessing!"
The Shock: Even when the AI copied the text perfectly, up to half the time, the quote didn't actually support the answer. It was like a student copying a sentence about "math class" to prove they are good at "cooking." The text was real, but the logic was broken.

4. The "Genius vs. The Artist" (Model Differences)
Not all AIs reacted the same way:

Model A (GPT) and Model B (Gemini) actually got better at the task when forced to show their work. It was like they got focused and stopped guessing.
Model C (Claude) got worse. It seemed to get confused by the extra rules and started making more mistakes. This shows that different AIs have different "personalities" and strengths.

The Big Takeaway: The "High-Trust" Filter

The main lesson of this paper is that forcing AI to show its work creates a "High-Trust Filter."

Imagine you are sorting mail.

Without the filter: The AI sorts 100 letters a minute, but 20 of them are in the wrong pile.
With the filter: The AI sorts 80 letters a minute. But for those 80, it attaches a sticky note saying, "I put this here because of line 4."
The Magic Step: You then run a quick check on the sticky notes. If the note makes sense, you keep the letter. If the note is nonsense, you throw the letter into a "Human Review" pile.

By doing this, you end up with a smaller pile of letters, but almost all of them are perfectly sorted. You traded speed for safety.

Why This Matters for Medicine

In the real world, doctors can't just trust an AI to say, "This patient is eligible for this cancer trial." If the AI is wrong, a patient might get a treatment that doesn't work, or miss out on one that does.

This study suggests that the future of medical AI isn't just about making the AI smarter. It's about building systems that force the AI to prove its logic and then automatically checking if that proof holds up. If the AI can't show its work, or if the work doesn't make sense, the system should say, "Stop, a human needs to look at this."

It turns the AI from a "black box" that spits out answers into a "transparent assistant" that hands you the evidence, allowing humans to make the final, life-saving decision with confidence.

1. Problem Statement

Large Language Models (LLMs) are increasingly deployed for biomedical text processing tasks, such as clinical trial matching and eligibility screening. However, current evaluations predominantly focus on end-task accuracy, often neglecting transparency, verifiability, and trustworthiness.

The Trust Gap: LLMs can generate fluent, confident outputs that are factually incorrect or unsupported by the input text (hallucinations). In high-stakes clinical settings, a correct label without a verifiable basis is insufficient for auditability.
The Citation Problem: Existing attempts to improve trust often rely on bibliographic citations, which LLMs frequently hallucinate.
The Research Question: Can a lightweight, mechanically checkable requirement for models to "show their work" via verbatim quotes from the input text improve the auditability and reliability of LLM decisions without significantly degrading performance?

2. Methodology

Dataset and Task

Data: 200 oncology Randomized Controlled Trial (RCT) abstracts published between 2005–2023 from six major medical journals.
Task: Classify the eligibility scope of the trial based only on the title and abstract.
Labels: Four classes: LOCALIZED, METASTATIC, BOTH, NEITHER. A fifth label, UNCLEAR, was allowed for abstention if the abstract lacked sufficient information.
Ground Truth: Derived from manual annotation using full-text review, mapped to the four classes.

Models Evaluated

Three flagship models were tested under default vendor configurations (no fine-tuning):

GPT-5.2 (OpenAI)
Gemini 3 Flash (Google)
Claude Opus 4.5 (Anthropic)

Experimental Conditions

Each trial was evaluated under two independent conditions:

Baseline: Model outputs only the classification label.
Evidence-Required: Model outputs the label plus a verbatim quote from the abstract supporting the decision.
- Constraint: The quote must be an exact substring of the abstract (after whitespace normalization).
- Abstraction: If the model chose UNCLEAR, no quote was required.

Validation and Metrics

Mechanical Validity: Automated checks ensured the quote was an exact substring of the normalized abstract.
Semantic Validity ("Judge" Step): A secondary LLM (acting as a judge) evaluated whether the provided quote actually supported the assigned label (SUPPORTED vs. NOT SUPPORTED).
Stability: Each condition was run three times per trial to measure run-to-run consistency using Fleiss' $\kappa$ (for labels) and Jaccard similarity (for quote selection).
Performance Metrics:
- Coverage: Percentage of non-abstained, valid outputs.
- Conditional Performance: Accuracy and Macro-F1 calculated only on non-abstained predictions.
- Correct & Mechanically Valid: The intersection of label correctness and quote validity.

3. Key Results

Impact on Coverage and Validity

Coverage Reduction: Enforcing evidence requirements modestly reduced coverage (increased abstentions) across all models:
- GPT-5.2: 86.2% $\to$ 84.3%
- Gemini 3 Flash: 98.3% $\to$ 92.8%
- Claude Opus 4.5: 96.0% $\to$ 94.5%
Invalid Outputs: Gemini 3 Flash saw a notable increase in invalid outputs (formatting errors) when required to provide quotes (1.5% $\to$ 5.0%).
Mechanical Validity: Of the outputs that included quotes, 83.3% to 91.2% were mechanically valid (exact substrings).

Impact on Classification Performance

Model-Dependent Effects: The effect of the evidence constraint on accuracy varied significantly by model:
- GPT-5.2 & Gemini 3 Flash: Macro-F1 scores slightly increased (e.g., GPT-5.2: 0.910 $\to$ 0.916).
- Claude Opus 4.5: Macro-F1 significantly decreased (0.828 $\to$ 0.777), with a statistically significant drop in accuracy ( $p=0.009$ ).
Stability: Label stability remained high across repetitions (Fleiss' $\kappa$ $κ$ > 0.82). However, quote stability varied:
- GPT-5.2 and Claude showed high quote consistency (Jaccard $\approx$ 0.90).
- Gemini 3 Flash showed low quote consistency (Jaccard $\approx$ 0.66), indicating it selected different evidence spans for the same label across runs.

Semantic Grounding (The "Judge" Step)

Mechanical $\neq$ Semantic: A mechanically valid quote does not guarantee semantic support.
- Only 48.0% to 78.8% of evidence-bearing predictions were judged as semantically supported by the LLM judges.
- Gemini 3 Flash had the lowest semantic support rate (48.0–59.0%).
Selective Prediction: When filtering results to include only semantically supported predictions, Macro-F1 scores improved significantly (e.g., GPT-5.2 rose to 0.947), but at the cost of drastically reduced coverage.

4. Key Contributions

Operationalizing "Show Your Work": The study demonstrates that requiring verbatim, substring-verifiable quotes is a feasible, automated method to create an audit trail for biomedical NLP, avoiding the hallucination risks associated with bibliographic citations.
Decoupling Mechanical and Semantic Validity: The research highlights a critical gap: a model can provide a technically perfect quote that fails to logically support its conclusion. This necessitates a two-step validation (format check + semantic check).
Vendor-Specific Behavior: The study reveals that "evidence requirements" are not neutral; they interact differently with model architectures. While some models (GPT, Gemini) improved or maintained performance, others (Claude) degraded significantly, suggesting that prompt engineering for evidence extraction is model-specific.
Stability as a Metric: The paper argues that evidence span stability (consistency of the quote across runs) is a crucial, often overlooked metric for auditability. Unstable evidence undermines the reliability of the audit trail even if the label is correct.
Selective Prediction Framework: The authors propose a workflow where a "judge" LLM filters outputs based on semantic grounding. This creates a high-trust subset of predictions suitable for automation, while routing uncertain or unsupported cases to human review.

5. Significance and Implications

Clinical Safety: In clinical trial screening and patient matching, knowing why a model made a decision is as important as the decision itself. Verbatim quotes allow human reviewers to instantly verify the model's reasoning against the source text.
Shift in Evaluation Paradigm: The paper advocates moving beyond "accuracy-only" reporting. Future evaluations of biomedical LLMs should include coverage, mechanical validity, semantic grounding, and evidence stability.
Practical Workflow Integration: The findings suggest that for high-stakes tasks, a "selective prediction" strategy (automating only the high-confidence, well-supported cases) is superior to forcing automation on all inputs. This balances efficiency with safety.
Limitations & Future Work: The study notes that abstract-only inputs limit the ability to capture nuanced eligibility criteria (often found in full text). Future work should explore more flexible evidence constraints (e.g., character offsets, multi-span quotes) and validate the "judge" step against human domain experts.

Conclusion:
Requiring verbatim evidence makes LLM outputs more auditable and encourages conservative behavior, but it introduces a trade-off between coverage and verifiability. The approach is most effective when combined with a semantic validation step to filter out "hallucinated logic" masked by valid quotes, enabling a tiered system of high-trust automation.