From Raw Data to Reliable Predictions: The Significance of Data Processing in COVID-19 Modelling

This study demonstrates that a custom data preprocessing pipeline featuring daily data transformation, localized outlier detection, consistency checks, and iterative feature selection significantly enhances the predictive accuracy of machine learning models for COVID-19 mortality compared to standard methods.

Sangita Das, Subhrajyoti Maji

Published 2026-02-27✓ Author reviewed
📖 5 min read🧠 Deep dive

Imagine you are trying to predict how many people will get sick or pass away during a pandemic. To do this, you build a "crystal ball" using math and computers (Machine Learning). But here's the catch: a crystal ball is only as good as the glass it's made of. If the glass is dirty, cracked, or warped, your prediction will be wrong, no matter how smart your math is.

This paper is about cleaning that glass. The authors, Sangita Das and Subhrajyoti Maji, argue that most scientists spend too much time building fancy crystal balls (complex models) and not enough time cleaning the glass (preparing the data).

Here is the story of how they fixed the data to get a much clearer picture of the future.

1. The Problem: The "Weekly Report" Glitch

Imagine you are tracking a runner's speed. But instead of recording their speed every second, you only write down their total distance once a week, on Sunday. Then, you try to guess their speed for Monday, Tuesday, and Wednesday.

  • The Mistake: You might assume they ran zero miles on Monday through Saturday and did all the running on Sunday. That's obviously wrong!
  • The Reality: The COVID-19 data had this exact problem. Hospitals often reported new deaths only once a week. The data showed "0 deaths" for six days and a huge spike on the seventh.
  • The Fix: The authors created a "Daily Distributor." Instead of letting the data sit at zero for six days, they took the total weekly number and spread it out evenly across all seven days. This smoothed out the fake spikes and gave the computer a realistic view of what was happening every single day.

2. The Problem: The "Global" vs. "Local" Outlier

Imagine you are looking at a crowd of people.

  • The Standard Approach (Global): You set a rule: "Anyone taller than 7 feet is an error." If you see a 7-foot-2-inch basketball player, you might delete them because they don't fit the "average" height. But in a pandemic, a sudden spike in cases might look like a "giant" compared to the average, but it's actually real!
  • The Custom Approach (Local): The authors used a "Rolling Window" (like looking through a magnifying glass that moves with you). They only looked at the people right next to the person in question. If a sudden spike happened, they checked if it was weird compared to the immediate past few days, not compared to the whole year. This allowed them to keep real, important spikes while only removing actual errors.

3. The Problem: The "Broken Calculator"

Sometimes, data columns contradict each other.

  • The Scenario: Imagine a column says "Total Vaccinations" is 100, but the "New Vaccinations" column says 50. If you add them up, the math doesn't work.
  • The Fix: The authors built a "Logic Engine." Instead of just guessing missing numbers, they used the relationships between the columns. If they knew the "Total," they calculated the "New" by subtracting yesterday's total. If they knew the "New," they calculated the "Total" by adding it up. This ensured the data was mathematically perfect and consistent, like a well-oiled machine.

4. The Problem: Too Many Ingredients

Imagine you are baking a cake. You have 60 ingredients: flour, sugar, eggs, but also "socks," "toothpicks," and "old newspapers."

  • The Mistake: Throwing everything into the bowl confuses the baker (the computer). It wastes time and makes the cake taste bad.
  • The Fix: The authors used a "Smart Filter." They tested every ingredient to see if it actually helped the cake rise. They kept only the top 5 ingredients that mattered most (like "New Cases" and "Stringency Index") and threw away the rest. This made the model faster and more accurate.

The Result: A Stunning Improvement

To see if their "cleaning crew" worked, they tested 10 different computer models.

  • The Standard Way (Dirty Glass): The best model they could get was like a blurry photo. It had a score (R²) of 0.817. It was okay, but it missed a lot of details.
  • The Custom Way (Clean Glass): With their special cleaning steps, the same type of model became a high-definition 4K photo. It achieved a score of 0.991.

What does that mean?
The standard model was guessing with about 20% error. The custom model was guessing with less than 1% error. It was almost perfect.

The Big Takeaway

The main lesson of this paper isn't just about COVID-19. It's a lesson for anyone trying to predict the future, whether it's stock markets, weather, or sports scores.

You can have the most expensive, powerful computer in the world, but if you feed it messy, biased, or inconsistent data, it will give you garbage answers.

By taking the time to:

  1. Spread out the data (fixing the weekly reporting bias),
  2. Look locally (finding real spikes, not just global errors),
  3. Check the math (ensuring columns agree with each other), and
  4. Pick the best ingredients (selecting the right features),

...you can turn a mediocre prediction into a highly accurate one. The authors proved that how you prepare the data is just as important as the model you build.


Note: This preprint has been peer-reviewed and published as: 'From Raw Data to Reliable Predictions: The Significance of Data Processing in COVID-19 Modelling,' Asian Journal of Research in Computer Science, 19(2), 75-96 (2026). DOI: https://doi.org/10.9734/ajrcos/2026/v19i2826. Code: https://github.com/dassangita844/Preprocessing_COVID-19_Dataset_India

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →