Accuracy and efficiency of using artificial intelligence for data extraction in systematic reviews. A noninferiority study within reviews

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to build a giant, perfect puzzle of the world's health policies. To do this, you need to read hundreds of scientific studies and pull out specific facts (like "how many people were in the study" or "what medicine they took"). This process is called data extraction.

Traditionally, this job is done by humans. It's like having a team of scribes sitting in a library, reading every single book, and copying the important bits into a ledger. It takes forever, it's boring, and even the best scribes make mistakes when they get tired.

This paper asks a simple question: Can we hire a super-smart robot assistant to help the scribes, or even replace one of them, without ruining the quality of the puzzle?

Here is the story of their experiment, explained simply.

The Experiment: Humans vs. Humans + Robot

The researchers set up a race between two teams to extract data from 50 different scientific studies about children's health.

Team Human: Two experienced researchers read the studies and typed the data into a spreadsheet using only their eyes and brains.
Team Hybrid: Two researchers did the same job, but they had a secret weapon: an AI tool called Elicit®. This tool acts like a super-fast librarian. You ask it a question (e.g., "What was the age of the participants?"), and it scans the document, finds the answer, and even shows you the exact sentence where it found it. The human then checks the answer and types it in.

The Results: Who Won?

The researchers measured three things: Accuracy (did they get the facts right?), Speed (how long did it take?), and Cost (how much did it cost to run the race?).

1. Accuracy: The "Perfect Score" Test

Imagine a teacher grading a test.

The Result: Both teams got almost the exact same score. The AI-assisted team didn't make more mistakes than the humans. In fact, for one specific part of the puzzle (describing the "intervention" or the treatment), the AI team was actually better at getting the details right.
The Metaphor: Think of the AI as a spell-checker that never gets tired. It didn't introduce new typos; it just helped the human catch the ones they might have missed.

2. Speed: The "Fast Food" vs. "Slow Cook"

The Result: The AI-assisted team was much faster. On average, they finished each study about 25 minutes sooner than the human-only team.
The Metaphor: If the human-only team was walking to the grocery store, the AI-assisted team was driving a sports car. They didn't just walk faster; they skipped the traffic jams. Over 50 studies, this saved the team over 20 hours of work. That's like getting an entire work week back!

3. Errors: The "Hallucination" Myth

People often worry that AI will "hallucinate"—meaning it will make up facts that aren't true (like saying a study happened in 1990 when it actually happened in 2020).

The Result: The AI made up facts just as rarely as the humans did. Both teams made similar types of mistakes, mostly small ones like missing a tiny detail or rounding a number slightly wrong.
The Metaphor: The AI wasn't a crazy dreamer making up stories; it was a diligent intern who occasionally forgot to write down a phone number, just like a human would.

4. Cost: The "Wallet" Check

The Result: The AI-assisted method was actually cheaper. Even though the AI tool costs money to subscribe to, the time saved meant the researchers spent less money on their salaries.
The Metaphor: It's like buying a slightly expensive coffee machine. It costs more upfront than a kettle, but because it makes coffee so fast, you save money on the barista's wages in the long run.

The Big Takeaway

This study is a "non-inferiority" trial. In plain English, that means they weren't trying to prove the AI was better than humans; they just wanted to prove it wasn't worse.

The verdict? The AI assistant is non-inferior. It is just as accurate, but much faster and cheaper.

Why This Matters

Systematic reviews are the "gold standard" for making health decisions. If we can use AI to do the boring, time-consuming data gathering, human experts can stop acting like data-entry clerks and start doing what they do best: thinking, analyzing, and making sense of the big picture.

The researchers suggest that in the future, we might see a workflow where one human and one AI work together as a team, or where two AIs check each other's work, with a human just stepping in to double-check the final result. It's not about replacing humans; it's about giving them a superpower.

Accuracy and efficiency of using artificial intelligence for data extraction in systematic reviews. A noninferiority study within reviews

The Experiment: Humans vs. Humans + Robot

The Results: Who Won?

1. Accuracy: The "Perfect Score" Test

2. Speed: The "Fast Food" vs. "Slow Cook"

3. Errors: The "Hallucination" Myth

4. Cost: The "Wallet" Check

The Big Takeaway

Why This Matters

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

Accuracy

Efficiency (Time-to-Completion)

Error Types and Severity

Cost

5. Significance and Implications

Accuracy and efficiency of using artificial intelligence for data extraction in systematic reviews. A noninferiority study within reviews

The Experiment: Humans vs. Humans + Robot

The Results: Who Won?

1. Accuracy: The "Perfect Score" Test

2. Speed: The "Fast Food" vs. "Slow Cook"

3. Errors: The "Hallucination" Myth

4. Cost: The "Wallet" Check

The Big Takeaway

Why This Matters

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

Accuracy

Efficiency (Time-to-Completion)

Error Types and Severity

Cost

5. Significance and Implications

More like this

The effect of sedentary behaviour and physical activity on 1719 diseases: a Mendelian randomisation phenome-wide association study (MR-PheWAS)

Assessing the Impact of Timing and Coverage of United States COVID-19 Vaccination Campaigns: A Multi-Model Approach

Evidence on WASH interventions in Negelle-Arsi District, Oromia Regional State, Ethiopia: a cross-sectional data analysis

Identification of Spatiotemporal Associations of Social Determinants of Health on the Incidence of Adverse Birth Outcomes in Louisiana

Physical activity buffers physiological stress during high emotional distress: a wearable-derived prospective cohort study