Spatially Robust Inference with Predicted and Missing at Random Labels

Here is an explanation of the paper "Spatially Robust Inference with Predicted and Missing at Random Labels," translated into simple language with creative analogies.

The Big Picture: The "Map vs. Reality" Problem

Imagine you are a city planner trying to figure out the average income of every household in a massive, sprawling city.

The Prediction (The Map): You have a super-smart AI that has looked at satellite photos, traffic patterns, and house sizes. It has generated a predicted income for every single house in the city. It's a complete map!
The Reality (The Audit): You don't have the budget to ask every household their actual income. So, you send out a few surveyors to check a small sample of houses.
The Problem:
- Missing Data: You only have real numbers for a tiny fraction of houses.
- Bias: Your surveyors didn't pick houses randomly. They mostly visited the wealthy downtown area because it was easier to get to. The AI's predictions for the poor suburbs might be wrong, but since you didn't check those areas, you don't know.
- The "Clumping" Issue: The houses you did visit are often neighbors. If one house is wealthy, its neighbor probably is too. This "clumping" makes it hard to tell if your estimate is accurate or just lucky.

The Goal: You want to combine the AI's full map with your small, biased sample to get a perfectly accurate average and, more importantly, a reliable confidence interval (a range that tells you how sure you are).

The Old Way: Why It Fails

Previous methods tried to fix this by assuming the surveyors picked houses randomly (like drawing names from a hat). But in the real world, they didn't. They picked based on location and ease of access.

The Flaw: When you try to fix the bias using the small sample, you use a technique called "Cross-Fitting." Imagine you split your surveyors into 5 teams. Team A learns from Teams B, C, D, and E, and then predicts for Team A.
The Glitch: Because the surveyors in Team A all learned from the same other teams, they all share the same "training noise." It's like if five students in a study group all memorized the same wrong answer from their teacher. When they take the test together, they all get the same wrong answer.
The Consequence: Standard statistical tools look at these students and think, "Wow, they are all getting the same answer because they live in the same neighborhood (spatial dependence)." They get scared and say, "The data is too messy! We need a huge safety margin!" This results in confidence intervals that are way too wide (useless) or too narrow (dangerously confident).

The New Solution: The "Jackknife-HAC" Fix

The authors propose a clever new method to untangle the "shared training noise" from the "real neighborhood patterns."

1. The Double-Robust Estimator (The "Two-Strap Backpack")

Think of your estimate as a backpack carrying two straps:

Strap A: The AI's prediction (the map).
Strap B: The correction based on the real survey data.
If the AI is perfect, Strap A holds the weight. If the AI is wrong but the survey data is good, Strap B holds the weight. You only fail if both are broken. This is called Double Robustness.

2. The Cross-Fitting Problem (The "Echo Chamber")

As mentioned, when you split the data into groups (folds) to train the AI, everyone in Group A hears the same "echo" from the training data. This echo looks like a real pattern, but it's just noise.

3. The Jackknife-HAC Correction (The "Noise Cancelling Headphones")

This is the paper's main invention. It works in three steps:

Step 1: Center the Groups (The "Subtract the Echo").
Imagine you take the average score of Team A and subtract it from every member of Team A. You are removing the "shared echo" that everyone in that team heard. Now, the remaining differences are just the individual variations, not the group noise.
Step 2: Measure the Real Patterns (The "HAC").
Now that you've removed the fake "group echo," you measure the real spatial dependence. Do neighbors actually have similar incomes? Yes? Okay, we account for that.
Step 3: Add the Group Variation Back (The "ANOVA").
You can't ignore the fact that the teams were different. So, you add back the variation between the teams (the difference between Team A's average and Team B's average).

The Result: You get a confidence interval that is just right. It's not too wide (conservative) and not too narrow (risky). It correctly separates "we all learned from the same teacher" from "we all live in the same rich neighborhood."

Why This Matters in the Real World

This isn't just about math; it's about making better decisions in critical fields:

Global Health: Estimating malaria rates in Africa. You have satellite images (predictions) but only a few ground tests. If you don't fix the "clumping" of tests, you might think you know the risk level when you actually don't.
Land Use: Counting how many trees were cut down in the Amazon. Remote sensing gives a full picture, but ground verification is sparse and biased toward accessible roads.
Census Data: Estimating income or life expectancy in specific neighborhoods.

The Takeaway

The authors built a statistical "noise-cancelling" tool.

When you have a mix of AI predictions and real-world data that is biased and clumped together, old tools get confused. They either panic and give you useless wide ranges, or they get overconfident and give you dangerous narrow ranges.

This new method acts like a filter:

It removes the "groupthink" noise caused by how we trained the AI.
It keeps the "real world" patterns caused by geography.
It gives you a reliable answer with a honest margin of error.

In short: It helps us trust our predictions even when the data is messy, biased, and clumped together.

Here is a detailed technical summary of the paper "Spatially Robust Inference with Predicted and Missing at Random Labels."

1. Problem Statement

The paper addresses the challenge of statistical inference when data is sparsely labeled but predicted labels are available for all units. This scenario is common in global health, land-use monitoring, and citizen science, where machine learning models generate full-coverage predictions, but only a small subset of units has verified ground-truth labels.

Existing methods for "prediction-powered inference" (e.g., Cross-PPI, PPI++) typically rely on two restrictive assumptions that often fail in real-world applications:

Independent and Identically Distributed (i.i.d.) Sampling: They assume labels are missing completely at random (MCAR). In reality, labeling is often Missing at Random (MAR), where the probability of a label depends on observed features or geography (e.g., auditors focus on specific regions or high-risk areas).
Independence of Observations: They assume spatial independence. Real-world data often exhibits spatial dependence, where nearby units are correlated.

The Core Conflict:
When researchers attempt to correct for MAR bias using Doubly Robust (DR) estimators with cross-fitting (to prevent overfitting nuisance models), a new problem arises in spatial settings. Cross-fitting induces fold-level correlation: units within the same cross-validation fold share the same estimated nuisance functions (outcome model $\hat{m}$ and propensity model $\hat{\pi}$ ). This shared estimation noise creates artificial correlations that standard spatial variance estimators (like Conley HAC) mistake for genuine spatial dependence. This leads to unstable or overly conservative confidence intervals (CIs).

2. Methodology

The authors propose a Jackknife Spatial Heteroscedasticity and Autocorrelation Consistent (HAC) variance correction that integrates with a standard Doubly Robust estimator.

A. Estimation Framework

Base Prediction: A pre-trained model provides predicted outcomes $\hat{Y}_i$ for all units.
Doubly Robust (DR) Estimator: The target is the population mean $\theta_0 = E[Y_i]$ . The estimator uses an influence function:
$\psi_i(\theta; m, \pi) = m(W_i, s_i) + \frac{R_i}{\pi(W_i, s_i)}(Y_i - m(W_i, s_i)) - \theta$
where $W_i$ includes covariates and predictions, $s_i$ is the spatial coordinate, $R_i$ is the label indicator, $m$ is the outcome model, and $\pi$ is the label propensity.
Cross-Fitting: To estimate $m$ and $\pi$ without overfitting, the data is split into $K$ folds. Nuisance models are trained on $K-1$ folds and evaluated on the held-out fold.

B. The Variance Correction (Jackknife-HAC)

The core innovation is a variance estimator that separates genuine spatial dependence from artificial fold-induced noise.

Fold-Centering: The authors decompose the DR score $\hat{\psi}_i$ $\hat{ψ}_{i}$ into a unit-specific component and a fold-specific random effect ( $a_k$ $a_{k}$ ) caused by shared nuisance estimation.
- They compute fold means $\bar{\psi}_k$ and define centered scores: $\tilde{\psi}_i = \hat{\psi}_i - \bar{\psi}_k$ .
- Proposition 2 proves that centering removes the fold-shared noise ( $a_k$ ) from the within-fold covariance calculation.
HAC Calculation: A Conley-style HAC estimator is applied to the centered scores $\tilde{\psi}_i$ to capture genuine spatial autocorrelation.
Between-Fold Restoration: The variance contribution from the fold means (the between-fold variability) is added back using an ANOVA-style term.
$\hat{V}_{JK} = \hat{V}_{within}(\tilde{\psi}) + \hat{V}_{between}$
This ensures the total variance accounts for both spatial dependence and the uncertainty introduced by estimating nuisance parameters.

C. Additional Features

Buffered Cross-Fitting: To prevent spatial leakage between folds, a buffer zone (based on distance quantiles) is excluded when training nuisance models for a specific fold.
Moran Gate: An optional heuristic that checks for residual spatial signal in the labeled data. If no signal is found, it defaults to an i.i.d. variance estimator to avoid over-correction.

3. Key Contributions

Identification of a New Failure Mode: The paper identifies that cross-fitting in spatially dependent data induces artificial correlations that standard HAC estimators misinterpret, leading to miscalibrated inference.
Novel Variance Estimator: The proposed Jackknife-HAC estimator is the first to explicitly decouple fold-induced noise from spatial dependence in the context of prediction-powered inference.
Theoretical Validity: Under standard assumptions (MAR, overlap, and dependent-array Central Limit Theorem conditions), the authors prove that the resulting confidence intervals are asymptotically valid.
Modularity: The approach is modular; it can be paired with any doubly robust point estimator and extended to other dependence structures (e.g., two-way clustering).

4. Empirical Results

The authors evaluated their method (Spatial DR-JK-HAC) against baselines (Cross-PPI, PPI++, Bootstrap-PPI) and other DR variants (DR-iid, Spatial DR-HAC) using simulations and five real-world datasets (Forest disturbance, Galaxy Zoo, Census income, Malaria, Health+).

Simulation Results:
- Under MAR labeling and spatial dependence, standard methods (Cross-PPI, PPI++) suffered severe under-coverage (e.g., dropping to ~46% coverage when 90% was nominal).
- Standard DR with naive HAC (Spatial DR-HAC) improved coverage but remained unstable or overly conservative due to the fold-noise issue.
- Spatial DR-JK-HAC maintained near-nominal coverage (e.g., ~90-91%) across all scenarios (MCAR/MAR, i.i.d./spatial sampling).
Real-World Data:
- In observed datasets with MAR mechanisms, the proposed method consistently achieved higher coverage (e.g., 0.87–0.92) compared to baselines (0.45–0.72).
- The method achieved this robustness at the cost of slightly wider confidence intervals (approx. 1.3x–1.5x wider than naive methods), which is a necessary trade-off for valid uncertainty quantification under dependence.

5. Significance

This paper provides a critical solution for the growing field of prediction-powered inference in the era of big data and machine learning.

Practical Impact: It enables researchers to use full-coverage ML predictions for population-level statistics (e.g., disease burden, deforestation rates) with valid uncertainty quantification, even when data collection is biased (MAR) and spatially correlated.
Methodological Advance: It resolves the tension between the need for cross-fitting (to handle complex ML models) and the need for accurate spatial variance estimation.
Generalizability: The framework suggests a broader path for inference where machine learning imputations are combined with sparse, structured data, extending beyond spatial settings to time series and multi-way clustered data.

In summary, the paper establishes that while doubly robust estimation handles selection bias, variance estimation must be specifically corrected for the noise introduced by the cross-fitting process in dependent data to achieve valid statistical inference.