Variable Selection for Linear Regression Imputation in Surveys

This paper addresses the underexplored challenge of variable selection for linear regression imputation in survey data by defining an optimal model via an oracle loss function, analyzing the consequences of model misspecification, and proposing a methodological framework for constructing asymptotically valid and optimal confidence intervals.

Ziming An, Mehdi Dagdoug, David Haziza

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you are a chef trying to recreate a famous soup recipe for a large banquet. You have a list of ingredients (the population), but when you go to the kitchen to gather them, you realize some of your helpers (the sample) forgot to bring certain items (missing data).

If you just ignore the missing ingredients, your soup will taste wrong (biased). To fix this, you decide to guess what those missing ingredients should be based on the ones you do have. This guessing process is called imputation.

However, there's a catch: Which ingredients should you use to make your guess?

  • Should you guess the missing salt based only on the missing pepper?
  • Or should you use the missing pepper, the missing garlic, and the missing onion?
  • What if you include a random spice that has nothing to do with the soup, just because you have it in your pantry?

This is the problem the paper tackles: Variable Selection for Imputation. It asks, "How do we pick the perfect set of clues to fill in the missing blanks in a survey?"

Here is the breakdown of their solution, using our kitchen analogy:

1. The "Oracle" (The Perfect Chef)

The authors imagine a magical "Oracle" chef who knows the true recipe perfectly. This Oracle knows exactly which ingredients matter and which don't. If the Oracle fills in the missing values, the resulting soup is perfect.

The paper proves that there is a mathematical way to find the "best" set of clues (variables) that gets us as close to this Oracle as possible. They call this the Optimal Imputation Model.

2. The Trap of "Too Few" vs. "Too Many"

The paper explores two common mistakes chefs make:

  • The "Too Few" Mistake (Underfitting): You try to guess the missing salt using only the missing pepper. If salt and pepper aren't actually related, your guess will be wrong, and the soup will be salty or bland. In survey terms, this leads to biased results (the wrong answer).
  • The "Too Many" Mistake (Overfitting): You try to guess the missing salt using pepper, garlic, onion, a random rock, and a shoe. While your guess might be technically "correct" on average, it becomes very unstable. If you change the rock to a different rock, your guess changes wildly. In survey terms, this increases the variance (your answer is shaky and unreliable).

The paper shows that the "Goldilocks" zone is finding the model that includes all the relevant clues but none of the irrelevant junk.

3. The Magic Tool: BIC (The Smart Filter)

The authors tested several tools to help the chef pick the right ingredients. They found that a specific tool called BIC (Bayesian Information Criterion) acts like a smart filter.

  • AIC (another tool) tends to be a bit greedy; it wants to include almost everything, leading to the "Too Many" mistake.
  • Cross-Validation is also a bit messy and often picks too many ingredients.
  • BIC, however, is strict. It penalizes you for adding unnecessary ingredients. The paper proves that as your sample size gets bigger, BIC will almost always find the exact right set of ingredients (the "True Model").

4. The "Magic" Result: You Don't Need to Worry

Here is the most exciting part of the paper. Usually, when you use a computer to pick the "best" model, statisticians get nervous. They worry that because the computer made a choice, the final math (like confidence intervals) is broken.

The authors prove that if you use a smart tool like BIC, you can pretend you knew the true recipe all along!

  • You pick the model.
  • You calculate your average.
  • You calculate your margin of error.

The math shows that the uncertainty introduced by choosing the model disappears as the data gets larger. You get the same perfect results as if the Oracle had told you the recipe from the start. This is called Oracle Efficiency.

5. The Simulation (The Taste Test)

To prove this works, they ran thousands of computer simulations (like running the soup recipe 20,000 times with different random missing ingredients).

  • They confirmed that the "smart filter" (BIC) consistently picked the right ingredients.
  • They confirmed that the final soup (the survey estimate) tasted exactly right (unbiased).
  • They confirmed that the "margin of error" they calculated was accurate (the confidence intervals were correct).

The Bottom Line

In the world of surveys, missing data is a huge headache. This paper gives survey statisticians a clear, mathematically proven rulebook:

  1. Don't guess randomly.
  2. Don't include every variable you have.
  3. Use a rigorous selection tool (like BIC) to find the "True Model."
  4. Once you do that, you can trust your results completely, knowing they are as accurate as if you had perfect information to begin with.

It turns a messy, guesswork-heavy process into a precise, reliable science.