Automation of Systematic Reviews with Large Language Models

This study validates "otto-SR," a large language model-based workflow that demonstrates high performance in automating article screening, data extraction, and risk of bias assessment, thereby enabling the rapid and reliable reproduction and updating of systematic reviews.

Cao, C., Arora, R., Cento, P., Budak, A., Manta, K., Farahani, E., Cecere, M., Selemon, A., Sang, J., Gong, L. X., Kloosterman, R., Jiang, S., Saleh, R., Margalik, D., Lin, J., Jomy, J., Xie, J., Chen, D., Gorla, J., Lee, S., Zhang, K., Kuang, J., Ware, H., Whelan, M. G., Teja, B., Leung, A. A., Arora, R. K., Pillay, J., Hartling, L., Detsky, A., Noetel, M., Emerson, D. B., Tricco, A. C., Church, G. M., Moher, D., Bobrovitz, N.

Published 2026-02-18
📖 4 min read☕ Coffee break read
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to solve a massive puzzle, but instead of having 100 pieces, you have 146,000 pieces scattered all over the floor. This is what doing a "Systematic Review" feels like for scientists. They need to find every single piece that fits a specific picture (e.g., "Does this drug cure headaches?"), ignore the junk, and then carefully assemble the valid pieces to see the final image.

Usually, this process is like trying to build that puzzle by hand in a dark room. It takes over a year, it's exhausting work, and because humans get tired, they sometimes miss pieces or put them in the wrong spots.

Enter "otto-SR": The Super-Powered Puzzle Assistant.

This paper introduces a new tool called otto-SR, which uses a "Large Language Model" (think of it as a super-smart, tireless robot librarian) to do the heavy lifting. The researchers wanted to see if this robot could do the three hardest jobs faster and better than human experts.

Here is how they tested it, using simple analogies:

1. The Great Filter (Article Screening)

The Job: Imagine you have a stack of 32,000 letters. You need to throw away the spam and keep only the important ones.
The Test: The robot looked at 32,357 citations.
The Result: The robot was amazing. It caught 96.7% of the important letters, while the human team only caught 81.7%. The robot was less likely to accidentally throw away a letter that actually mattered. It was like having a filter that never gets tired or distracted.

2. The Data Detective (Data Extraction)

The Job: Once you have the important letters, you need to read them and pull out specific numbers (like "How many people got better?").
The Test: The robot had to pull out nearly 4,500 data points from hundreds of studies.
The Result: The robot got the answers right 93.1% of the time. The humans got it right about 80% of the time. The robot didn't get confused by messy handwriting or complicated charts; it just read the data perfectly.

3. The Quality Inspector (Risk of Bias)

The Job: You need to check if the studies are trustworthy or if they were rigged.
The Test: The robot judged the quality of 345 studies.
The Result: When two robots (or a robot and a human) looked at the same study, they agreed almost perfectly. It was like having two inspectors who always shake hands on the verdict, whereas humans might argue more often.

4. The Time Traveler (Updating Reviews)

The Job: Science moves fast. A review done two years ago might be outdated today.
The Test: The robot took a set of famous reviews (Cochrane reviews) and tried to redo them instantly, adding all the new research that came out since then.
The Result: The robot didn't just copy the old work; it found nearly twice as many new, relevant studies as the original human authors did! Because it found more pieces, the final picture changed. In some cases, the robot's new picture showed a drug did work (statistically significant), while the old picture said it didn't. In one case, it showed a drug didn't work, changing the conclusion entirely.

The Big Picture

The researchers concluded that this AI tool isn't just a helper; it's a game-changer.

Think of it this way: Human researchers are like master chefs. They are brilliant, but they can only cook so many meals a day before they get tired. otto-SR is like a high-tech, automated kitchen. It can chop, mix, and taste thousands of ingredients in minutes without getting tired.

By using this robot, we can stop waiting a year for answers. We can get reliable, up-to-date medical evidence almost instantly, ensuring that doctors and patients have the best, most current information to make life-saving decisions. The future of medical research isn't just about humans working harder; it's about humans working smarter with a tireless AI partner.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →