📊 epidemiology

Bias and Variance of Adjusting for Instruments

This paper's simulation demonstrates that within the framework of large-scale propensity score adjustment, including instruments with a treatment correlation below 0.5 and an equipoise preference score above 0.5 introduces only minor bias, supporting the strategy of adjusting for many covariates rather than attempting to identify a limited set of confounders.

Original authors: Hripcsak, G., Anand, T., Chen, H. Y., Zhang, L., Chen, Y., Suchard, M. A., Ryan, P. B., Schuemie, M. J.

Published 2026-03-15

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Hripcsak, G., Anand, T., Chen, H. Y., Zhang, L., Chen, Y., Suchard, M. A., Ryan, P. B., Schuemie, M. J.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a mystery: Does Drug A actually help patients recover, or does it just look that way because of other factors?

In the real world, we can't run perfect experiments where we randomly assign people to take the drug or a placebo (that would be unethical or impossible for many conditions). Instead, we look at "observational data"—records of what people actually did.

The problem is Confounding.

The Scenario: Maybe people who take Drug A are generally healthier to begin with. If they recover faster, is it the drug, or was it their good health?
The Solution (Propensity Score): To fix this, statisticians use a "matchmaker" tool called a Propensity Score. It looks at hundreds of details about a patient (age, weight, other meds, past history) to create a "score" that says, "This person looks very similar to someone who took the drug, and this person looks similar to someone who didn't." By comparing these matched groups, we hope to isolate the drug's true effect.

The Big Debate: "More is Better" vs. "Less is More"

For decades, researchers have argued about which details to feed into this matchmaker tool:

The "Pick and Choose" Team: "Only include the obvious suspects (confounders). If we include too many variables, we might mess things up."
The "Throw Everything In" Team (LSPS): "Include every piece of data we have before the treatment started. Let the computer figure out what matters."

The Fear: The "Pick and Choose" team worries about Instruments.
An Instrument is a sneaky variable. It influences whether someone gets the drug, but it has zero effect on whether they get better or worse.

Analogy: Imagine a doctor who always prescribes Drug A to patients who live in a specific zip code. The zip code is an "instrument." It predicts the drug, but living in that zip code doesn't make you healthier.
The Worry: If you accidentally include the "zip code" in your matchmaker tool, you might distort the results, making the drug look worse or better than it really is.

What This Paper Did: The "Tug-of-War" Simulation

The authors (a team of medical data scientists) ran a massive computer simulation to settle this debate. They wanted to see: If we accidentally include a "sneaky" instrument in our matchmaker tool, how much does it actually hurt our results?

They set up a scenario with:

A Real Confounder: A factor that messes up the results (like "good health").
A Sneaky Instrument: A factor that only predicts the drug (like the "zip code").
The Test: They ran the simulation thousands of times, making the "zip code" influence the drug choice more and more strongly, and watched what happened to the final answer.

The Surprising Discovery

The results were counter-intuitive but very reassuring:

1. The "Noise" vs. The "Signal"
Even when the "sneaky instrument" was 20 times stronger at predicting who got the drug than the "confounder" was at messing up the results, including it in the model did not ruin the answer.

Analogy: Imagine you are trying to hear a whisper (the drug's true effect) in a noisy room.
- The Confounding is a loud, distracting shout that drowns out the whisper.
- The Instrument is a static hiss in the background.
- The old fear was: "If we add a microphone to filter out the static, we might accidentally amplify the shout!"
- The Reality: The authors found that even if the static hiss is incredibly loud, turning on the microphone (adjusting for the instrument) only adds a tiny bit of extra noise. It doesn't drown out the whisper nearly as much as the original shout (the unadjusted confounding) does.

2. The Safety Net (Diagnostics)
The paper also looked at the safety rules used by the "Throw Everything In" team (LSPS). They have two "stop signs":

The Correlation Check: If a variable is too strongly linked to the drug (like a correlation of 0.5 or higher), stop and check it.
The Equipoise Check: This measures if the groups are balanced. If the "matchmaker" is struggling to find matches, it's a red flag.

The simulation showed that as long as these safety checks are in place, the "sneaky instruments" that slip through are too weak to cause significant damage.

The Bottom Line

Don't be afraid to cast a wide net.

The study concludes that in the real world, it is much more dangerous to miss a real confounder (by trying to be too picky) than it is to accidentally include a weak instrument.

The Old Way: Trying to manually pick the "perfect" list of variables is like trying to find a needle in a haystack by only looking at the top layer. You might miss the needle.
The New Way (LSPS): Dumping the whole haystack into a sieve (using all data) and letting the computer filter it is safer. Even if a few pieces of straw (instruments) get through, they don't ruin the soup. The "safety checks" (correlation and equipoise) ensure that the really bad straw gets caught.

In short: When trying to figure out if a treatment works, it's better to be inclusive and let the data speak, rather than being overly cautious and accidentally ignoring the factors that actually matter. The "noise" of instruments is manageable; the "silence" of missing confounders is fatal to the study.

1. Problem Statement

In observational research, propensity score (PS) adjustment is a standard method for addressing confounding. However, there is a longstanding debate regarding covariate selection:

The Dilemma: Should researchers include all pre-treatment covariates (to ensure no confounders are missed) or carefully select a limited set of known confounders?
The Risk of Instruments: A primary concern with including all covariates is the inadvertent inclusion of instruments. An instrument is a variable associated with the treatment but not the outcome.
Theoretical Consequence: Theoretical literature suggests that adjusting for an instrument in the presence of unadjusted confounding can:
1. Amplify Bias: Increase the bias of the effect estimate (bias amplification).
2. Increase Variance: Widen the confidence intervals of the estimate.
The Gap: While Large-Scale Propensity Score (LSPS) methods include vast numbers of covariates and employ diagnostics to filter instruments, the specific operating characteristics of how much bias and variance are introduced by "weak" instruments (those that pass diagnostics) remain poorly quantified.

2. Methodology

The authors conducted a simulation study to quantify the bias and variance introduced by adjusting for instruments under conditions of unadjusted confounding.

Simulation Design:
- Variables: Defined a confounder ( $X$ ), a measured instrument ( $Z$ ), an unmeasured instrument ( $U$ ), treatment ( $T$ ), and outcome ( $Y$ ).
- Parameters:
  - Treatment effect ( $E$ ) was fixed at 0.5.
  - Confounding strength ( $C, D$ ) was fixed at 1.
  - Instrument strength ( $B$ ) was varied from 1 to 7.
  - To keep total treatment variance constant, the strength of the unmeasured instrument ( $R$ ) was adjusted inversely as $B$ increased.
- Models Tested:
  1. Crude ( $M_{crude}$ ): No covariates.
  2. Instrument Only ( $M_{instr}$ ): Adjusts for $Z$ only (illustrates bias amplification).
  3. Confounder Only ( $M_{conf}$ ): Adjusts for $X$ only (the "ground truth" for bias correction).
  4. Confounder + Instrument ( $M_{conf-instr}$ ): Adjusts for both.
- Sample Size: 200,000 observations per simulation.
- Diagnostics: Calculated the Pearson correlation between the instrument and treatment, and the equipoise (preference score) to mirror LSPS diagnostics.
Scenarios:
1. Single Instrument: Varied strength of one instrument.
2. Multiple Instruments: Repeated the analysis with 10 independent instruments to test the aggregate effect.

3. Key Contributions

Quantification of Bias Amplification: The study provides empirical data on exactly how much bias is added when adjusting for instruments that pass standard LSPS diagnostics (correlation < 0.5).
Validation of LSPS Diagnostics: It tests the efficacy of the LSPS rejection criteria (correlation $\ge$ 0.5 and equipoise thresholds) in preventing significant bias amplification.
Comparison of Strategies: It directly compares the "all-covariates" approach against the "careful selection" approach in a controlled simulation environment.

4. Key Results

Bias Amplification is Limited:
- Even when the variance contributed by the adjusted instrument was 20 times greater than the variance from the unadjusted confounder, the additional bias introduced by adjusting for the instrument was less than the bias caused by the unadjusted confounding itself.
- At the LSPS diagnostic threshold (correlation $\approx$ 0.5, equipoise $\approx$ 0.5), adjusting for the instrument increased the effect estimate bias by only ~50% relative to the existing confounding bias.
Variance Impact:
- Adjusting for an instrument increased variance, but the increase was modest (less than 50% increase over the base variance at the diagnostic threshold).
- The variance increase was not additive in a way that rendered the estimate unusable.
Multiple Instruments:
- When simulating 10 instruments, the bias never exceeded the bias caused by the unadjusted confounder, provided the equipoise remained above 0.5 (or even down to 0.475).
- The equipoise metric proved effective at detecting the aggregate effect of multiple weak instruments, even when individual correlations were low (e.g., 0.153).
Ground Truth: The model adjusting for only the confounder produced the correct estimate (0.5), confirming that the simulation setup was valid.

5. Significance and Conclusion

Support for Large-Scale Propensity Scores (LSPS): The findings strongly support the inclusion of large numbers of covariates (LSPS) over manual selection. The risk of including mild-to-moderate instruments is outweighed by the benefit of capturing unmeasured confounders.
Diagnostic Thresholds are Effective: The LSPS diagnostics (rejecting if correlation > 0.5 or equipoise is too low) are sufficient to prevent significant bias amplification. The authors suggest the equipoise threshold of 0.5 might even be conservative.
Practical Implication: Researchers should not fear including broad sets of covariates in propensity models. The "quest to adjust for confounding supersedes the risk of adjusting for an instrument," provided standard diagnostics are applied.
Limitation: The study is a simulation. While it aligns with previous empirical studies, the authors acknowledge that real-world data validation is an ongoing necessity.

Final Takeaway: In the context of LSPS, adjusting for instruments that pass standard correlation and equipoise diagnostics introduces only a minor amount of bias and variance, making the "include-all-covariates" approach superior to attempting to manually identify a limited set of perfect confounders.