Assessing Sensitivity to IV Exclusion and Exogeneity without First Stage Monotonicity

Imagine you are a detective trying to solve a mystery: Does watching a movie with friends actually make you want to watch it again later?

To solve this, you can't just ask people what they did, because maybe they just really liked the movie to begin with. You need a clever trick. You decide to use the weather as your clue.

The Logic: If it's a beautiful, sunny Saturday, people stay outside and don't go to the movies. If it's rainy, they stay inside and do go.
The Assumption: You assume the weather is a "random" event that only affects whether people go to the movies, and nothing else. It doesn't change the quality of the movie, and it doesn't change how much people like talking about it.

This is the standard way economists use Instrumental Variables (IV). But here's the problem: What if your clue is flawed?

Maybe on sunny days, people also go to outdoor festivals where they talk about movies, which changes their opinion. Or maybe movie studios release better movies on sunny days to compete with the weather. If the weather affects the outcome directly (not just through the movie ticket sales), your whole investigation is ruined.

The Paper's Big Idea: "What If We're Wrong?"

This paper by Diegert, Masten, and Poirier is like a stress test for your detective work.

Most previous methods for checking these clues relied on a strict rule called "Monotonicity." Think of this as a rule saying: "The weather must always push people in the same direction. It can never make a sunny day more likely to result in a movie night for some people and less likely for others."

The authors say: "That's too strict! Real life is messy."

They developed a new toolkit that says: *"We don't need to assume the weather always pushes people the same way. We just want to know: How much can our clue be slightly wrong before our whole conclusion falls apart?"*

The New Toolkit: The "Sensitivity Slider"

Imagine you have a slider on a control panel labeled "How much do we trust the weather?"

Slider at 0 (Perfect Trust): You assume the weather is a perfect, random clue. The paper calculates the answer based on this.
Slider at 1 (No Trust): You assume the weather has nothing to do with the outcome. The answer becomes a huge, useless range (like "The effect is between -100% and +100%").
The Middle Ground: The authors created a way to slide the knob to 0.1, 0.2, 0.5, etc. At each step, they ask: "If the weather is slightly 'contaminated' (e.g., it affects the movie choice by 5% more than we thought), what happens to our answer?"

They use a mathematical method called Linear Programming (think of it as a super-fast, high-tech calculator) to find the "best case" and "worst case" scenarios for every position of that slider.

The Real-World Test: Movie Tickets

The authors tested their new toolkit on a famous study about movie peer effects.

The Original Study: Found that if a movie does poorly on its opening weekend (because of bad weather), it actually does better later because people talk about it. They concluded: "Yes, peer pressure works!"
The New Stress Test: The authors took that study and ran their "Sensitivity Slider."

The Shocking Result:
When they assumed the weather was perfect, the result held up. But the moment they allowed for a tiny amount of doubt (just a 1.5% chance that the weather was influencing things in a weird way), the result collapsed.

The "positive effect" disappeared. The answer became: "We don't know if peer pressure works or not; the data is consistent with zero effect."

Why This Matters

Think of this paper as a safety net for science.

In the past, if a study said "X causes Y," we often had to take it on faith that their assumptions were perfect. If the assumptions were slightly wrong, the whole conclusion could be a lie, and nobody would know until years later.

This paper gives researchers a way to say:

"Here is our conclusion. But look at this graph. If our assumption is wrong by just a tiny bit, the conclusion changes. If it's wrong by a medium bit, the conclusion vanishes. So, you should be very careful trusting this result."

The Takeaway

This paper doesn't just give you an answer; it gives you a map of uncertainty.

It tells us that in the world of data, perfection is rare. Instead of pretending our clues are perfect, we should measure exactly how fragile our conclusions are. If a conclusion breaks with the slightest nudge, it's not a solid fact—it's a house of cards.

In short: The authors built a machine that tells you how much "wiggle room" you have in your assumptions before your scientific discovery turns into a guess. And in the case of movie peer effects, it turns out the wiggle room was almost non-existent.

1. Problem Statement

Instrumental Variable (IV) analysis relies on two core assumptions: exclusion (the instrument affects the outcome only through the treatment) and exogeneity (the instrument is randomly assigned). In many empirical settings, these assumptions are difficult to justify. Furthermore, the standard IV literature often imposes a third assumption: first-stage monotonicity (e.g., the instrument always increases or never decreases the probability of treatment).

The paper addresses the following gaps:

Lack of Monotonicity: Many designs (e.g., judge IV, leniency designs) violate monotonicity. Existing sensitivity analyses often require it.
Heterogeneity: Most sensitivity analyses assume homogeneous treatment effects or linear models. This paper allows for arbitrary heterogeneity.
Continuous Outcomes: Previous methods for sensitivity analysis with continuous outcomes are limited or computationally intractable.
Falsifiability: Standard IV models are testable; if the data contradicts the assumptions, the model is falsified. The paper seeks to characterize the minimal relaxation of assumptions required to make the model consistent with the data.

2. Methodology

The authors develop a framework for partial identification under a unifying class of sensitivity models that relax exclusion and exogeneity without imposing monotonicity.

A. The Framework

Variables: Let $X$ be the treatment, $Z$ the instrument, and $Y$ the outcome. Potential outcomes are denoted $Y(x, z)$ .
No Monotonicity: The framework allows for "defiers" (units where $X(1) < X(0)$ ) and does not restrict the first stage.
Sensitivity Models: The authors introduce a continuous relaxation parameter $\theta \in [0, 1]$ $θ \in [0, 1]$ .
- $\theta = 0$ : Full exclusion and exogeneity hold (standard IV).
- $\theta = 1$ : No assumptions hold (Manski bounds).
- $\theta \in (0, 1)$ : Intermediate levels of violation.
Unifying Class: The paper nests three prominent sensitivity models as special cases:
1. Marginal Sensitivity Model (MSM): Bounds the odds ratio between the instrument and potential outcomes.
2. c-dependence: Bounds the maximum difference between conditional and unconditional probabilities of the instrument.
3. Kolmogorov-Smirnov (KS) / Supremum Distance: Bounds the distance between distribution functions (or densities) of potential outcomes given different instrument values.

B. Discrete Outcomes (Section 2)

When outcomes are discrete (e.g., binary):

The identified set for the conditional probabilities of potential outcomes, $p_Y$ , is the intersection of the Manski bounds (derived from observed data) and the sensitivity constraint set (defined by $\theta$ ).
This intersection forms a closed convex polytope.
Linear Programming (LP): Because the constraints are linear and the objective functionals (like ATE, ATT, QTE) are linear in the probabilities, the sharp bounds for these parameters are obtained by solving a finite-dimensional linear program.
Falsification Frontier: The authors derive the "falsification point" ( $\underline{\theta}$ ), the minimum value of $\theta$ for which the identified set is non-empty. If the standard IV model ( $\theta=0$ ) yields an empty set, the model is falsified.

C. Continuous Outcomes (Section 3)

When outcomes are continuous, the identified set is an infinite-dimensional object (a set of density functions).

Infinite-Dimensional LP: The problem is formulated as an optimization over a space of density functions subject to linear inequality constraints.
Sieve Approximation: To make this computationally tractable, the authors propose approximating the infinite-dimensional density space using Bernstein polynomials (a sieve space).
Discretization: The continuous constraints are approximated by evaluating them on a finite grid of points.
Result: The infinite-dimensional problem is converted into a finite-dimensional linear program that can be solved using standard software. The authors prove that as the sieve dimension and grid size increase, the approximate bounds converge to the true identified set.

3. Key Contributions

Relaxation of Monotonicity: The paper provides the first general sensitivity analysis framework for IV that does not require first-stage monotonicity, making it applicable to complex designs like judge IVs.
Unifying Sensitivity Model: It formalizes a general class of relaxations that encompasses MSM, c-dependence, and KS distance, allowing researchers to choose the metric most appropriate for their context.
Computational Tractability for Continuous Outcomes: By combining sieve methods with linear programming, the authors provide a practical algorithm for calculating sharp bounds on treatment effects (ATE, ATT, QTE) with continuous outcomes under IV violations.
Falsification Adaptive Sets: The method explicitly calculates the "falsification point," allowing researchers to determine the smallest deviation from the baseline assumptions required to reconcile the model with the data.
Continuity and Monotonicity: The authors prove that the identified sets and the resulting bounds are continuous and monotonic with respect to the sensitivity parameter $\theta$ , ensuring stable sensitivity plots.

4. Results and Empirical Application

Empirical Application:
The authors revisit Gilchrist and Sands (2016), which studies peer effects in movie viewership using weather as an instrument.

Context: Weather is used as an instrument for opening-weekend viewership. The concern is that weather might affect subsequent viewership directly (e.g., via social learning or dynamic studio behavior), violating exclusion.
Findings:
- Under the baseline assumption of full exogeneity/exclusion ( $\theta=0$ ), the study finds a positive peer effect (a negative shock to opening viewership increases the probability of low subsequent viewership).
- Sensitivity: The conclusion is highly sensitive. Under c-dependence, the identified set for the Average Treatment Effect (ATE) includes zero at a very low sensitivity parameter ( $c = 0.015$ ). This implies that if the instrument's validity is violated by just 1.5 percentage points, the conclusion of a positive peer effect disappears.
- Distributional Effects: For Quantile Treatment Effects (QTE), the bounds are tight for the lower quantiles (under exogeneity) but widen significantly as $c$ increases or for higher quantiles, reflecting the lack of monotonicity and data sparsity in the upper tail.

5. Significance

Robustness for Applied Researchers: The paper provides a toolkit for applied researchers to move beyond binary "valid/invalid" instrument assessments. Instead, they can generate sensitivity plots showing how their conclusions change as the instrument becomes "imperfect."
Transparency: By quantifying the "breakdown point" (the $\theta$ where the effect becomes zero), researchers can transparently report how strong the violation of assumptions must be to overturn their findings.
Generalizability: The method applies to binary, discrete, and continuous outcomes, and handles multiple instruments and treatments without restrictive monotonicity assumptions.
Theoretical Rigor: The paper bridges the gap between theoretical partial identification (often abstract) and practical computation (via LP and sieves), making rigorous sensitivity analysis accessible for complex IV settings.

In summary, this paper advances the field of causal inference by providing a flexible, computationally feasible, and theoretically sound method to assess the robustness of IV estimates when the strict assumptions of exclusion, exogeneity, and monotonicity are relaxed.