📄 public and global health

Cumulative In-Context Learning versus Simple Historical Weighting for Real-Time Geographic Origin Identification of Ongoing Epidemic Waves: A Comparative Evaluation Using Eight COVID-19 Waves in Japan

This study demonstrates that a transparent, spreadsheet-implementable statistical method using cumulative historical weighting performs comparably to a large language model in identifying the geographic origins of Japan's COVID-19 waves, revealing that the performance gain stems from the accumulation of historical data rather than the AI's reasoning capabilities, though the model still exhibits significant intrinsic geographic reasoning without such context.

Original authors: Nakagawa, S., Yamamoto, A.

Published 2026-05-25

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Nakagawa, S., Yamamoto, A.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Question: Where Did the Virus Start?

Imagine a new wave of a virus (like a ripple in a pond) starts spreading across Japan. Public health officials want to know exactly where that ripple began as quickly as possible. If they know the starting point, they can send help, test people, and stop the spread before it hits the whole country.

Usually, scientists have to wait weeks for lab tests (genomic sequencing) to confirm the origin. But by then, the virus has often already spread everywhere. This study asked: Can we predict the starting point faster using just the daily numbers of sick people, without waiting for the lab?

The Three Competitors

The researchers set up a race between three different "detectives" to see who could find the origin of 8 different virus waves in Japan the fastest (within 7, 14, 21, or 28 days).

The "Fresh Eyes" Statisticians (Traditional Methods):
These are standard math formulas. They look only at the current wave. They ask: "Which region has the highest number of cases right now?" or "Which region started getting sick first?" They treat every new wave as if it's the first time the virus has ever existed. They have no memory of the past.
The "Super-Brain" AI (Large Language Model):
This is a powerful AI (Claude Haiku). It was given the current numbers plus a history book of all the previous 7 waves. It was told: "Look at the current data, but remember that in the past, waves often started in these specific places." It uses its "in-context learning" to guess the origin.
The "Smart Spreadsheet" (Cumulative Calculation):
This is the paper's secret weapon. It's a simple math formula that looks exactly like the "Fresh Eyes" statisticians, but it adds a "bonus point" to regions that have been the starting point of waves in the past.
- Analogy: Imagine a sports team. The "Fresh Eyes" coach only looks at today's practice. The "Smart Spreadsheet" coach looks at today's practice plus a note that says, "This player has scored the winning goal in 5 out of the last 7 games." It's a simple arithmetic trick, not a complex AI.

The Race Results

The researchers measured success using an "F1 score" (a grade from 0 to 1, where 1 is perfect).

The "Fresh Eyes" Statisticians: They were okay, getting a grade of about 0.41 to 0.46. They missed a lot because they forgot the lessons of the past.
The "Super-Brain" AI: When it used its history book, it got a grade of 0.52. It did better than the fresh statisticians.
The "Smart Spreadsheet": Surprisingly, this simple math method got a grade of 0.51.

The Big Surprise: The simple spreadsheet performed almost exactly the same as the fancy AI. The paper concludes that the AI didn't win because it is "smarter" or has better reasoning; it won because it was reminded of history. The simple spreadsheet did the exact same thing by just adding a "history bonus" to the math.

The "Magic" of the AI (Without the History)

The researchers also tested the AI without giving it any history (just the current numbers).

Result: The AI still got a 0.46.
What this means: The AI has some "natural" ability to guess geography based on its training, even without being told the history. However, once you give it the history (or give the spreadsheet the history bonus), the AI doesn't get much better. The "history" is the real magic, not the AI itself.

The One Time Everyone Failed (Wave 6)

There was one specific wave (Omicron BA.1) where everyone failed (Grade 0.00).

Why? The virus started in a way that the daily numbers didn't catch. It was like a thief entering a house through a secret tunnel that the security cameras couldn't see. Because the data was missing, neither the math, the spreadsheet, nor the AI could find the origin. This proves that if the data is bad or missing, no amount of clever computing can fix it.

The Final Takeaway

The AI isn't a miracle worker: For this specific job, a fancy AI isn't necessary.
History is key: The most important thing for predicting where a virus starts is remembering where it started before.
Keep it simple: You don't need expensive servers or complex AI to do this. You can do it with a spreadsheet (like Excel) by simply adding a "history bonus" to the regions that have been trouble spots before.

In short: To find where a virus wave starts, don't just look at today's numbers. Look at the past. And you don't need a robot to do that; a simple calculator with a memory works just as well.

Technical Summary: Cumulative In-Context Learning vs. Simple Historical Weighting for Epidemic Origin Identification

Problem Statement
Early identification of the geographic origin of epidemic waves is critical for targeted public health interventions, such as contact tracing and travel advisories. However, conventional statistical methods for origin estimation (e.g., cross-correlation, Granger causality, early growth rates) typically treat each epidemic wave as an independent event. This approach fails to leverage accumulated epidemiological knowledge regarding which regions historically serve as introduction points. While Large Language Models (LLMs) offer a potential mechanism for "cumulative learning" by incorporating historical context into predictions, it remains unknown whether LLMs outperform conventional statistical baselines in early detection, or whether the specific advantage of cumulative learning can be replicated using transparent, interpretable statistical methods.

Methodology
The study evaluated three computational approaches across eight COVID-19 epidemic waves in Japan (Waves 2–8, 2020–2023), using prefecture-level case count data aggregated into 11 regional blocks. Predictions were made at 7, 14, 21, and 28 days after wave onset and validated against genomically confirmed origins.

Non-Cumulative Statistical Baselines (B0–B5): Six methods treated each wave independently without historical context:
- B0: Early Onset Day (time to exceed incidence threshold).
- B1: Peak Infection Rate (maximum incidence in the observation window).
- B2: OLS Growth Rate (normalized exponential growth slope).
- B3: Cumulative Infection Rate (total cases in the observation window).
- B4: Cross-correlation Lead Score (temporal precedence of regional time series).
- B5: Granger Causality Score (predictive priority of one region over others).
- Note: For all methods, the top-3 ranked regions were designated as predicted origins.
Cumulative-Learning LLM: A general-purpose LLM (Claude Haiku) was used without fine-tuning. It received structured prompts containing current-wave data (incidence rates, onset days) and cumulative historical context (confirmed genomic origins, highest/lowest rates, and variants from all prior waves). The model was tasked with identifying the top-3 origin regions based on this combined context. A non-cumulative LLM condition (current data only) was also tested to isolate intrinsic reasoning capabilities.
Cumulative Calculation Statistical Baselines: To test if the LLM's advantage was due to "reasoning" or simply "historical weighting," the authors implemented transparent arithmetic versions of the best-performing baselines (B1 and B3). These methods added a weighted historical frequency term ( $P(r,n)$ ) to the current-wave score:
$Score_{cumul}(r) = Score_{baseline}(r) + \lambda \times P(r,n)$
Where $P(r,n)$ is the proportion of prior waves where region $r$ was a confirmed origin, and $\lambda$ was set to 0.75 based on sensitivity analysis.

Key Contributions

Comparative Evaluation: The study provides the first systematic comparison of general-purpose LLMs against established statistical baselines for the specific task of geographic epidemic origin identification using routine surveillance data.
Decoupling Mechanism: It isolates the "cumulative learning" mechanism from the "LLM reasoning" mechanism, demonstrating that the performance gain comes from the weighting of historical data rather than the neural network's intrinsic reasoning.
Transparent Implementation: The authors provide a four-step, spreadsheet-implementable algorithm (Box 1) that replicates LLM-level accuracy without requiring AI infrastructure, proprietary APIs, or black-box models.

Results

Performance at 14 Days: Cumulative calculation statistical baselines (B1_cumul, B3_cumul) achieved a mean F1 score of 0.51, performing comparably to the cumulative-learning LLM (0.52) and significantly outperforming all non-cumulative statistical baselines (F1 range: 0.41–0.46).
LLM Intrinsic Capacity: The non-cumulative LLM (no historical context) achieved an F1 of 0.46, matching the best non-cumulative statistical baselines (B1, B3) and outperforming others. Notably, the non-cumulative LLM detected Wave 6 (Omicron BA.1) with an F1 of 0.40, whereas all statistical methods failed (F1 = 0.00).
Wave-Specific Outcomes:
- Wave 7 (Omicron BA.5): Correctly identified at 14 days by both cumulative methods and the LLM (F1 = 1.00).
- Wave 6 (Omicron BA.1): Undetected by all methods (F1 = 0.00). The authors attribute this to the wave's origins (Okinawa and Chugoku) being linked to early cluster events that predated entry into the routine domestic surveillance system, meaning the input data lacked the necessary signal.
Feature Engineering: The study notes that the LLM did not process raw data but rather human-designed epidemiological summaries. The performance may reflect the quality of this feature engineering as much as the model's reasoning.

Significance and Claims
The paper claims that the cumulative historical weighting mechanism, rather than the LLM's specific reasoning capabilities, is the primary driver of performance improvement in early epidemic origin identification. The convergence of the transparent statistical method (F1 = 0.51) and the LLM (F1 = 0.52) suggests that for structured spatial reasoning tasks in epidemiology, simple arithmetic implementations of historical priors are sufficient and preferable due to their transparency, auditability, and lack of dependency on AI infrastructure.

The authors position this approach not as a replacement for genomic surveillance, but as a deployable, hypothesis-generating complement that can provide probabilistic origin estimates in real-time (within 14 days of onset) using only routinely available case data. The study emphasizes that while LLMs show substantial intrinsic geographic reasoning capacity (evidenced by the non-cumulative LLM's performance), their marginal advantage over transparent statistical methods in this specific context does not yet justify the complexity and cost of AI deployment in routine public health practice. The systematic failure in Wave 6 serves as a critical reminder that no analytical method can compensate for absent surveillance signals.