From Raw Data to Reliable Predictions: The Significance of Data Processing in COVID-19 Modelling

Imagine you are trying to predict how many people will get sick or pass away during a pandemic. To do this, you build a "crystal ball" using math and computers (Machine Learning). But here's the catch: a crystal ball is only as good as the glass it's made of. If the glass is dirty, cracked, or warped, your prediction will be wrong, no matter how smart your math is.

This paper is about cleaning that glass. The authors, Sangita Das and Subhrajyoti Maji, argue that most scientists spend too much time building fancy crystal balls (complex models) and not enough time cleaning the glass (preparing the data).

Here is the story of how they fixed the data to get a much clearer picture of the future.

1. The Problem: The "Weekly Report" Glitch

Imagine you are tracking a runner's speed. But instead of recording their speed every second, you only write down their total distance once a week, on Sunday. Then, you try to guess their speed for Monday, Tuesday, and Wednesday.

The Mistake: You might assume they ran zero miles on Monday through Saturday and did all the running on Sunday. That's obviously wrong!
The Reality: The COVID-19 data had this exact problem. Hospitals often reported new deaths only once a week. The data showed "0 deaths" for six days and a huge spike on the seventh.
The Fix: The authors created a "Daily Distributor." Instead of letting the data sit at zero for six days, they took the total weekly number and spread it out evenly across all seven days. This smoothed out the fake spikes and gave the computer a realistic view of what was happening every single day.

2. The Problem: The "Global" vs. "Local" Outlier

Imagine you are looking at a crowd of people.

The Standard Approach (Global): You set a rule: "Anyone taller than 7 feet is an error." If you see a 7-foot-2-inch basketball player, you might delete them because they don't fit the "average" height. But in a pandemic, a sudden spike in cases might look like a "giant" compared to the average, but it's actually real!
The Custom Approach (Local): The authors used a "Rolling Window" (like looking through a magnifying glass that moves with you). They only looked at the people right next to the person in question. If a sudden spike happened, they checked if it was weird compared to the immediate past few days, not compared to the whole year. This allowed them to keep real, important spikes while only removing actual errors.

3. The Problem: The "Broken Calculator"

Sometimes, data columns contradict each other.

The Scenario: Imagine a column says "Total Vaccinations" is 100, but the "New Vaccinations" column says 50. If you add them up, the math doesn't work.
The Fix: The authors built a "Logic Engine." Instead of just guessing missing numbers, they used the relationships between the columns. If they knew the "Total," they calculated the "New" by subtracting yesterday's total. If they knew the "New," they calculated the "Total" by adding it up. This ensured the data was mathematically perfect and consistent, like a well-oiled machine.

4. The Problem: Too Many Ingredients

Imagine you are baking a cake. You have 60 ingredients: flour, sugar, eggs, but also "socks," "toothpicks," and "old newspapers."

The Mistake: Throwing everything into the bowl confuses the baker (the computer). It wastes time and makes the cake taste bad.
The Fix: The authors used a "Smart Filter." They tested every ingredient to see if it actually helped the cake rise. They kept only the top 5 ingredients that mattered most (like "New Cases" and "Stringency Index") and threw away the rest. This made the model faster and more accurate.

The Result: A Stunning Improvement

To see if their "cleaning crew" worked, they tested 10 different computer models.

The Standard Way (Dirty Glass): The best model they could get was like a blurry photo. It had a score (R²) of 0.817. It was okay, but it missed a lot of details.
The Custom Way (Clean Glass): With their special cleaning steps, the same type of model became a high-definition 4K photo. It achieved a score of 0.991.

What does that mean?
The standard model was guessing with about 20% error. The custom model was guessing with less than 1% error. It was almost perfect.

The Big Takeaway

The main lesson of this paper isn't just about COVID-19. It's a lesson for anyone trying to predict the future, whether it's stock markets, weather, or sports scores.

You can have the most expensive, powerful computer in the world, but if you feed it messy, biased, or inconsistent data, it will give you garbage answers.

By taking the time to:

Spread out the data (fixing the weekly reporting bias),
Look locally (finding real spikes, not just global errors),
Check the math (ensuring columns agree with each other), and
Pick the best ingredients (selecting the right features),

...you can turn a mediocre prediction into a highly accurate one. The authors proved that how you prepare the data is just as important as the model you build.

Note: This preprint has been peer-reviewed and published as: 'From Raw Data to Reliable Predictions: The Significance of Data Processing in COVID-19 Modelling,' Asian Journal of Research in Computer Science, 19(2), 75-96 (2026). DOI: https://doi.org/10.9734/ajrcos/2026/v19i2826. Code: https://github.com/dassangita844/Preprocessing_COVID-19_Dataset_India

1. Problem Statement

The study addresses a critical gap in predictive modeling for COVID-19 mortality: the tendency to prioritize model architecture over rigorous data preprocessing. The authors identify four specific deficiencies in standard preprocessing pipelines when applied to time-series epidemiological data:

Reporting Biases: Standard pipelines often fail to account for weekly reporting patterns (e.g., data reported as a weekly total on a specific day), leading to artificial spikes and zeros that distort trend analysis.
Global Outlier Detection: Using fixed global thresholds (e.g., Z-scores) fails to capture local variability inherent in time-series data, often removing valid data points or missing true anomalies.
Ignoring Computational Dependencies: Standard methods treat columns independently, failing to leverage mathematical relationships between variables (e.g., Total Deaths vs. New Deaths), which can lead to data inconsistencies.
Suboptimal Feature Selection: Neglecting rigorous, iterative feature selection can result in multicollinearity and redundancy, causing overfitting or underfitting.

2. Methodology

The study utilizes a dataset from Our World in Data (OWID) focusing on India (Jan 2020 – Aug 2024), comprising 1,680 records with 67 columns. The authors compare a Standard Preprocessing Pipeline against a novel Custom Preprocessing Pipeline across ten regression models (including Linear, Ridge, Lasso, SVR, Random Forest, Gradient Boosting, Decision Tree, KNN, and MLP).

A. Standard Pipeline (Baseline)

Imputation: Linear interpolation/extrapolation followed by zero-filling.
Outlier Handling: Global Z-score threshold (2.0) with interpolation.
Feature Selection: Iterative removal based on Correlation (>0.8), Permutation Feature Importance (PFI), Mutual Information (MI), Single Feature Impact (SFI), and Variance Inflation Factor (VIF) with a threshold of 10.
Scaling: Standardization (Mean=0, Std=1).

B. Custom Pipeline (Proposed)

The custom pipeline introduces four innovative steps tailored to the data's structure:

Weekly Pattern Imputation:
- Logic: Recognizes that "New Cases/Deaths" are often reported as weekly totals.
- Action: Distributes the weekly total evenly across the seven days of that week to eliminate artificial spikes and zeros, correcting reporting bias.
Local Outlier Processing:
- Logic: Global thresholds ignore local context in time-series data.
- Action: Uses a rolling 30-day window with a Z-score threshold of 2. This preserves natural data variance while identifying true local anomalies.
Computation Processing (Dependency Leveraging):
- Logic: Columns often have strict mathematical relationships (e.g., $New = Current - Previous$, $Total = Cumulative Sum$).
- Action: Columns are grouped into "New," "Independent," and "Remaining."
  - New Columns: Derived by calculating the difference between current and previous total values.
  - Total Columns: Derived as cumulative sums of "New" columns.
  - Derived Rates: Calculated using specific formulas (e.g., Positive Rate = $\sum New Cases / \sum New Tests$ ) rather than imputed with constants.
- Result: Ensures perfect mathematical consistency between related features.
Iterative Feature Selection:
- Logic: Systematically refines the feature set to eliminate redundancy and multicollinearity.
- Action: Combines PFI, MI, SFI, and VIF. Features with high VIF (>10) and low combined importance are iteratively removed until all remaining features have VIF < 10.

3. Key Contributions

Novel Preprocessing Framework: The paper proposes a structured pipeline that explicitly handles reporting artifacts (weekly patterns) and mathematical dependencies, which are often overlooked in standard ML workflows.
Local vs. Global Outlier Detection: Demonstrates that local, rolling-window outlier detection is superior for preserving the integrity of time-series epidemiological data compared to static global thresholds.
Computational Consistency: Introduces a method to derive features based on their inherent mathematical relationships, ensuring data integrity that standard imputation methods cannot achieve.
RMSE Variance Metric: Introduces RMSE Variance (variance of RMSE across train, validation, and test sets) as a metric to evaluate model stability and generalizability, not just raw accuracy.

4. Results

The study evaluated ten models using the Standard vs. Custom pipelines. The results show a dramatic improvement with the custom pipeline:

Best Performing Model: The MLP Regressor with the Custom Pipeline achieved:
- Test RMSE: 66.556 (vs. 419.340 in Standard)
- Test $R^2$ : 0.991 (vs. 0.350 in Standard)
- RMSE Variance: 52.125 (vs. 13,739.921 in Standard)
Comparison with Standard Best: The best model in the Standard pipeline was the Decision Tree Regressor (RMSE: 222.858, $R^2$ : 0.817). The Custom MLP significantly outperformed this.
Stability: The Custom Pipeline models exhibited significantly lower RMSE Variance, indicating they are less prone to overfitting and generalize better across data splits.
Feature Efficiency: The Custom Pipeline achieved higher accuracy with fewer features (5 selected features vs. 7 in the Standard pipeline), demonstrating that the quality of features (via computation processing) outweighs quantity.

5. Significance and Conclusion

The paper concludes that data preprocessing is as critical as model selection in predictive modeling. The study demonstrates that:

Tailored Preprocessing: Addressing domain-specific issues (like weekly reporting cycles and column dependencies) yields massive gains in predictive accuracy.
Data Integrity: Methods that preserve data variance (Local Outlier Detection) and ensure mathematical consistency (Computation Processing) lead to more robust models.
Generalizability: While focused on COVID-19 mortality in India, the proposed pipeline (especially the handling of time-series reporting artifacts and iterative feature selection) offers a blueprint for improving predictive modeling in other domains with complex, time-dependent datasets.

The findings suggest that future pandemic management and health resource allocation strategies should rely on models built with such comprehensive, domain-aware preprocessing pipelines to ensure reliability and accuracy.

From Raw Data to Reliable Predictions: The Significance of Data Processing in COVID-19 Modelling

1. The Problem: The "Weekly Report" Glitch

2. The Problem: The "Global" vs. "Local" Outlier

3. The Problem: The "Broken Calculator"

4. The Problem: Too Many Ingredients

The Result: A Stunning Improvement

The Big Takeaway

1. Problem Statement

2. Methodology

A. Standard Pipeline (Baseline)

B. Custom Pipeline (Proposed)

3. Key Contributions

4. Results

5. Significance and Conclusion

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank