Fine-Tuning A Large Language Model for Systematic Review Screening

Imagine you are a librarian trying to find the one perfect book for a very specific reading club. You have a massive warehouse with 8,500 books (titles and abstracts). Your job is to read the back cover of every single one to decide: "Keep this for the club" or "Throw it in the recycling bin."

Doing this by hand is exhausting. It could take you a year.

Recently, people tried using AI robots (Large Language Models) to do this sorting for them. But the robots were acting like confused tourists. If you asked them nicely, they sometimes got it right; if you asked them a slightly different way, they got it wrong. They were too "context-dependent"—they needed the perfect hint to work, and even then, they weren't reliable enough to trust with the whole job.

The Big Idea: "Teaching the Robot, Not Just Asking It"

The authors of this paper had a different idea. Instead of just asking the robot to sort the books, they decided to train a small, specific robot to be an expert on this specific reading club.

Think of it like this:

The Old Way (Prompting): You walk up to a smart but generic AI and say, "Please find books about AI in computer science." The AI guesses based on its general knowledge. It's okay, but it misses a lot.
The New Way (Fine-Tuning): You take a small, cheap robot and show it 371 examples of books you've already sorted. You say, "See this one? It's a 'Keep.' See this one? It's a 'Throw away.' Learn the pattern."

They took a small AI model (only 1.2 billion "brain cells," which is tiny for AI standards) and taught it specifically how you (the human researcher) make decisions.

The Results: From Clueless to Champion

Here is what happened when they tested the two approaches:

The "Generic" Robot (Before Training):
It was a disaster. It agreed with the human librarian only 6.5% of the time. It was basically throwing darts blindfolded. It was so bad that its agreement score was actually negative (meaning it was doing worse than random chance!).
The "Trained" Robot (After Fine-Tuning):
After the short training session (which took only 2 minutes on a single computer chip!), the robot became a superhero.
- Agreement: It now agreed with the human librarian 86.4% of the time.
- Safety Net: Most importantly, it caught 91% of the books that should have been kept. In this job, it's better to accidentally keep a book you don't need (which a human can throw away later) than to accidentally throw away a book you do need.

The "Second Pair of Eyes" Strategy

The paper suggests a clever workflow to save time and money:

Human First: A human reads the titles and makes the first cut.
AI Second: The trained robot reads the same titles and acts as a second pair of eyes.
The Reconciliation: If the human says "Keep" and the robot says "Throw," or vice versa, a human checks that specific disagreement.

This is like having a co-pilot for your plane. You (the human) are still flying, but the robot is watching the instruments. If the robot spots something you missed, it alerts you. This means you don't need to hire two humans to do the same job (which is the current expensive standard); you can hire one human and one cheap, fast robot.

Why This Matters

Speed: What used to take months could take weeks.
Cost: You save money by not needing a second human reviewer for every single paper.
Reliability: The robot was consistent. If you asked it the same question three times with slightly different settings, it gave the exact same answer every time.

The Catch (Limitations)

The authors are honest about the limitations:

The "Training" Takes Time: You have to spend time gathering those 371 examples and training the robot. It's not a magic button you press instantly for a new topic.
Specific to the Job: This robot is an expert on this specific reading club. If you start a new review about "Baking Bread," you'd have to retrain the robot with new examples. It can't just magically know everything about baking bread without learning first.

The Bottom Line

This paper proves that if you take a small, cheap AI and teach it specifically how you think, it becomes a powerful tool. It won't replace the human librarian entirely, but it can be the ultimate assistant, handling the heavy lifting so humans can focus on the final, most important decisions.

In short: Don't just ask the AI to do the work; give it a crash course in your specific style, and watch it become your best employee.

1. Problem Statement

Systematic reviews are critical for synthesizing research but are notoriously time-consuming, often taking an average of 67 weeks to complete. A significant bottleneck occurs during the title and abstract screening phase, where researchers must manually review thousands of studies to determine inclusion or exclusion.

Current Limitations: While Large Language Models (LLMs) have been proposed to automate this, previous attempts using prompting alone have yielded inconsistent results. LLMs are highly context-dependent, and generic prompts often fail to capture the specific inclusion criteria of a particular review.
Hypothesis: The authors posit that fine-tuning a small LLM on human-coded data from a specific systematic review will yield higher performance and consistency than generic prompting, effectively leveraging the model's context-dependence as a feature rather than a bug.

2. Methodology

Dataset and Human Coding

Source Data: 8,694 titles and abstracts from a systematic review on "Generative AI in Undergraduate Computer Science Education."
Human Labels: All studies were manually coded for inclusion ('1') or exclusion ('0') by three human coders. Inter-rater agreement was >99%.
Data Partitioning:
- Training Set: 315 abstracts (intentionally enriched with positive examples to address class imbalance; 38.4% inclusion rate).
- Validation/Test Set: 56 abstracts (30.4% inclusion rate).
- Full Evaluation Set: 8,277 abstracts (the remaining studies not used in training/validation).
- Note: The training set was artificially balanced to ensure the model learned the inclusion criteria, acknowledging that real-world screening usually has 1–5% inclusion rates.

Model Selection and Architecture

Base Model: Liquid AI's LFM2.5-1.2B-Instruct (1.2 billion parameters).
Rationale: Selected for its compact size, allowing local training on consumer/single-GPU hardware while maintaining strong instruction-following capabilities.
Fine-Tuning Approach: Supervised Fine-Tuning (SFT) using Full-Parameter Fine-Tuning (updating all weights, not just adapters like LoRA).
- Reasoning: Full fine-tuning was chosen to maximize task-specific capacity and avoid adapter-induced constraints, given the single-target task and small base model.
Training Stack: Implemented using Unsloth integrated with Hugging Face/TRL.
- Hardware: Single NVIDIA B200 GPU (180GB memory).
- Efficiency: Training completed in ~2 minutes (320 steps) with peak memory usage <40GB.
- Hyperparameters: AdamW (8-bit) optimizer, learning rate $2 \times 10^{-5}$ , batch size 2 (accumulated to 8), BF16 precision.
- Loss Function: Response-only masking (computing loss only on assistant tokens) to focus learning on screening decisions rather than instruction templates.

Evaluation Protocol

Metrics: Balanced Accuracy, Weighted F1, Per-class Precision/Recall/F1, and Confusion Matrices.
Agreement Statistics: Cohen's Kappa, PABAK (Prevalence-Adjusted Bias-Adjusted Kappa), and Gwet's AC1 (robust against class imbalance).
Inference Consistency: Tested across three decoding temperatures ( $T \in \{0.1, 0.4, 0.8\}$ ) to ensure deterministic behavior.

3. Key Results

Baseline vs. Fine-Tuned Performance

Base Model (No Fine-Tuning): Performed poorly on the full dataset.
- Accuracy: 6.52%
- Balanced Accuracy: 53.07%
- Weighted F1: 11.52%
- Agreement with humans: 6.52% (Gwet's AC1: -0.863).
Fine-Tuned Model (Validation Set): Achieved near-perfect performance on the held-out test set.
- Accuracy: 94.64%
- Balanced Accuracy: 94.49%
- Weighted F1: 94.68%
- False Negative Rate: 5.88%.
Fine-Tuned Model (Full Dataset - 8,277 studies): Demonstrated strong generalization.
- Accuracy: 86.40% agreement with human raters.
- Balanced Accuracy: 88.78% (a 35.71% improvement over the base model).
- Weighted F1: 92.31% (an 80.79% improvement over the base model).

Class-Specific Performance

Exclude Class (Majority): High precision (99.96%) and recall (86.38%).
Include Class (Minority): High recall (91.18%) but low precision (2.69%).
- Significance: In systematic reviews, minimizing False Negatives (missing a relevant study) is prioritized over minimizing False Positives. The model successfully identified 91.18% of relevant studies, meaning only 8.82% of eligible studies were missed.

Consistency and Reliability

Temperature Robustness: The model showed perfect agreement (100%) across all three inference temperatures ( $T=0.1, 0.4, 0.8$ ), indicating highly deterministic behavior.
Multi-Rater Agreement: When treating the human and three model runs as four raters, Fleiss' Kappa was 0.603, and Gwet's AC1 was 0.842, indicating strong reliability.

4. Key Contributions

Demonstration of Fine-Tuning Efficacy: Proved that fine-tuning a small (1.2B parameter) LLM on specific review data drastically outperforms zero-shot prompting, turning a "dismal" 6.5% accuracy into 86.4%.
Hardware Accessibility: Showed that high-performance fine-tuning for systematic reviews can be achieved on single-GPU consumer hardware (NVIDIA B200) in minutes, making the approach accessible to individual researchers without massive compute clusters.
Workflow Integration Proposal: Proposed a "Human-in-the-Loop" or "AI-as-Second-Screener" workflow. Since the model has a low false-negative rate, it can act as a second screener to validate human decisions or filter out obvious exclusions, significantly reducing the workload of double-screening (a current best practice).
Open Source Release: Released the fine-tuned model, dataset, and codebase to support reproducibility.

5. Significance and Implications

Efficiency: This approach offers a viable path to reduce the 67-week average timeline of systematic reviews. By automating the initial screening or acting as a second screener, researchers can save significant time and financial resources.
Context-Dependence as an Asset: The study reframes the context-dependence of LLMs. Instead of trying to force a generic model to work via prompting, the authors show that specializing the model via fine-tuning on the specific review's data yields superior results.
Practical Trade-off: The model accepts a higher rate of False Positives (including irrelevant studies for human review) to ensure a very low rate of False Negatives (missing relevant studies). This is the ideal trade-off for systematic reviews, where missing a study compromises the review's validity.
Future Directions: The authors note that while promising, further research is needed on:
- Generalizability across different domains and models.
- The minimum size of training data required (specifically for the minority "include" class).
- Comparing full fine-tuning vs. Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA.

Conclusion: The paper establishes that fine-tuning small LLMs is a highly effective, accessible, and reliable method for automating title and abstract screening in systematic reviews, provided the goal is to maximize the detection of relevant studies (recall) rather than perfect precision.