Post-Hoc Large-Sample Statistical Inference

Imagine you are a detective trying to solve a mystery using a pile of clues (data). In the world of statistics, your goal is to draw a "net" (a confidence interval) around the true answer (the parameter) to catch it.

The Old Problem: The "Rigid Rulebook"

For decades, statisticians had a strict rulebook: You must decide how "tight" your net needs to be before you even look at the clues.

This "tightness" is called the significance level (usually denoted as $\alpha$ , like 0.05 or 5%).

The Scenario: You decide to use a 5% net. You look at the data, and the net you catch is huge and fuzzy. It's useless! You can't say anything meaningful.
The Dilemma: You want to try a "looser" net (say, 10% or 20%) to get a clearer picture. But the old rules say, "No! You picked 5% at the start. If you change it now based on what you see, you are cheating."
The Consequence: If you change the rules mid-game, your statistical guarantees vanish. It's like a judge saying, "I'll only accept evidence if you promise the verdict before the trial starts." If you change your mind after seeing the evidence, the verdict is invalid.

This is called the "Roving Alpha" problem. It forces analysts to either stick with bad results or cheat to get better ones.

The New Solution: The "Magic Scorecard" (E-Values)

This paper introduces a new way to do detective work using something called E-values (Evidence values). Think of an E-value as a magic scorecard that accumulates evidence against a suspect (a hypothesis).

The Magic Trick: With this scorecard, you don't need to decide how strict you are before the trial. You can look at the clues, see how strong the evidence is, and then decide, "Okay, I'm willing to accept a 5% risk," or "Actually, I'll accept a 20% risk."
Why it works: The math behind E-values is designed so that even if you change your mind about the rules after seeing the data, the scorecard still holds up. It's like a game where the score is calculated in a way that prevents you from "gaming the system" by changing the rules mid-play.

The Big Leap: From "Small Samples" to "Big Data"

Previously, this "Magic Scorecard" trick only worked for small, simple datasets where you had to make very strong assumptions (like "the data must be perfectly normal"). If the data was messy or huge, the trick broke down.

This paper is the breakthrough: It extends the Magic Scorecard to Large Samples (Big Data).

The Old Way (Non-asymptotic): To use the scorecard on messy data, you had to assume the data was very well-behaved. If it wasn't, the scorecard was too conservative (giving you huge, useless nets) or simply didn't work.
The New Way (Asymptotic): The authors developed a version of the scorecard that works when you have lots of data. As the amount of data grows to infinity, the scorecard becomes incredibly accurate.
- The Benefit: You can now use this flexible, "post-hoc" method on real-world, messy, large datasets without needing to make impossible assumptions about the data's shape.

The Three New Tools

The paper offers three specific "nets" (confidence intervals) for different situations:

The "Anchored" Net: You pick a "best guess" for how strict you want to be (e.g., "I think 1% is the right level") and set your scorecard based on that. Even if the real answer turns out to be 5% or 10%, this net stays surprisingly tight and accurate. It's like guessing the weather is sunny; even if it rains, your umbrella is still the right size.
The "Mixture" Net: Instead of guessing one level, this net blends many different levels together. It's a bit wider (looser) than the anchored net, but it guarantees you won't be caught off guard if your guess was way off. It's the "safety first" option.
The "Sequential" Net: This is the most powerful tool. It allows you to keep looking at data forever. You can stop whenever you want, or keep going, and the net remains valid. It's like having a net that grows and shrinks automatically as you walk through a forest, always keeping the "truth" inside, no matter how long you walk.

Why Should You Care?

In the real world, scientists and analysts often look at data, see a vague result, and then tweak their analysis to get a clearer answer. The old rules said this was "p-hacking" (cheating).

This paper says: "You don't have to cheat to be flexible."

It gives statisticians a rigorous, mathematically proven way to:

Look at the data first.
Decide how much risk they are willing to take after seeing the results.
Still get a valid, trustworthy answer.

It turns statistics from a rigid, pre-planned exam into a flexible, adaptive conversation with the data, all while keeping the "truth" safely caught in the net.

Here is a detailed technical summary of the paper "Post-Hoc Large-Sample Statistical Inference" by Chugg et al.

1. Problem Statement

Classical statistical inference relies on the assumption that the significance level ( $\alpha$ ) is fixed before data analysis begins. This "pre-registration" requirement prevents analysts from adapting their significance thresholds based on the data (e.g., widening a confidence interval if the initial result is inconclusive) without invalidating the statistical guarantees (Type-I error control). This limitation is known as the "problem of roving alphas."

While e-values have recently been established as the necessary and sufficient tool for nonasymptotic post-hoc inference (allowing data-dependent $\alpha$ ), existing nonasymptotic methods suffer from significant drawbacks:

They require strong moment assumptions (e.g., bounded support or sub-Gaussianity).
They are often conservative, leading to wide confidence intervals.
They do not leverage the efficiency of large-sample approximations.

The Core Gap: There was no theoretical framework for asymptotic post-hoc inference. Standard asymptotic methods (like Wald intervals) are efficient but fail under post-hoc $\alpha$ selection. This paper aims to bridge this gap by developing a theory of post-hoc inference that remains valid in the limit as sample size $n \to \infty$ , while maintaining weaker assumptions than nonasymptotic methods.

2. Methodology and Framework

2.1 Asymptotic Post-Hoc Validity

The authors redefine validity for large samples. Instead of bounding the probability of error for a fixed $\alpha$ , they bound the post-hoc risk:
$\limsup_{n \to \infty} \sup_{P \in \mathcal{P}} \mathbb{E}_P \left[ \sup_{\alpha > 0} \frac{\mathbb{1}\{\theta(P) \notin H_n(\alpha)\}}{\alpha} \right] \leq 1$
This formulation allows $\alpha$ to be chosen as a function of the data (or even after seeing the interval) while guaranteeing that the "average" error rate across all possible $\alpha$ choices remains controlled.

2.2 The Role of Asymptotic E-Variables

The paper establishes a fundamental equivalence (Proposition 2.6):

Necessity/Sufficiency: A sequence of confidence sets is a valid asymptotic post-hoc confidence interval (APH-CI) if and only if it is derived by thresholding an asymptotic e-variable.
Definition: An asymptotic e-variable is a sequence of non-negative random variables $(E_n)$ such that $\limsup_{n \to \infty} \mathbb{E}_P[E_n] \leq 1$ .
Distribution-Uniformity: The authors distinguish between pointwise validity (valid for each $P$ individually) and distribution-uniform validity (valid simultaneously for a class of distributions $\mathcal{P}$ ), which is crucial for robustness against worst-case distributions.

3. Key Contributions

The paper introduces three distinct constructions for APH-CIs, each based on different asymptotic e-variables:

A. The IWR Asymptotic E-Variable (Ignatiadis, Wang, Ramdas)

Construction: Based on the statistic $E_n^{IWR}(\theta; \lambda) = \exp\left( \lambda \frac{S_n(\theta)}{V_n(\theta)} - \frac{\lambda^2}{2} \right)$ , where $S_n$ is the sum of deviations and $V_n$ is the self-normalized sum.
Theoretical Advance: The authors prove this is a valid asymptotic e-variable under the Domain of Attraction of a Normal (DAN) assumption (weaker than finite variance) and extend it to the distribution-uniform setting under a uniformly bounded third-moment (skewness) assumption.
Parameter Selection ( $\lambda$ ):
1. Ex Ante Anchoring: Fix $\lambda$ based on a pre-specified "anchor" $\alpha_0$ . This yields tight intervals if the actual $\alpha$ is close to $\alpha_0$ .
2. Method of Mixtures: Integrate $\lambda$ over a truncated Gaussian distribution. This removes the need for a specific anchor, providing robustness against $\alpha$ drift, though with slightly wider intervals.

B. The R-WS Asymptotic E-Variable (Ruf, Waudby-Smith)

Construction: Uses a truncation technique combined with a nonasymptotic Strong Law of Large Numbers (SLLN). The e-variable is defined as $E_n^{R-WS} = (\int \dots dF(\lambda)) \wedge T_n$ , where $T_n$ is a truncation sequence.
Theoretical Advance: This method requires only a $2+\delta$ moment assumption (weaker than the third moment required for IWR in the uniform setting).
Stronger Guarantee: Unlike the IWR methods, the R-WS construction yields an Asymptotic E-Process. This allows for Post-Hoc Asymptotic Confidence Sequences (APH-CS), meaning the interval is valid not just at a fixed sample size, but at any stopping time (optional continuation) and for any data-dependent $\alpha$ .
Trade-off: The width scales as $\sqrt{\log(n)/n}$ rather than $1/\sqrt{n} $, making it wider for finite$ n$, but it offers the strongest theoretical guarantee (time-uniformity + post-hoc validity).

C. Regularized E-Variables (Appendix)

The paper also explores "regularized" e-variables ( $E^{reg}$ ) that add a penalty term to the denominator to ensure uniform integrability, providing alternative constructions with similar properties to IWR.

4. Results and Simulations

Width Comparison:
- IWR (Ex Ante): When the anchor $\alpha_0$ is close to the chosen $\alpha$ , this method produces the tightest intervals, often comparable to the standard Wald CI (which is not post-hoc valid).
- IWR (Mixture): Slightly wider than Ex Ante IWR but offers better worst-case performance if $\alpha$ is chosen far from the anchor.
- R-WS: Produces the widest intervals due to the $\sqrt{\log(n)}$ factor, reflecting the cost of time-uniformity.
Risk Control: Simulations confirm that while the Wald CI fails dramatically under data-dependent $\alpha$ selection (risk $\gg 1$ ), all proposed APH-CIs maintain the theoretical risk bound of 1.
Heavy-Tailed Data: The methods remain valid for distributions with heavy tails (e.g., $t$ -distributions with 3 degrees of freedom) provided the moment assumptions are met, outperforming nonasymptotic methods that often require bounded support.
Type-I Error: The paper analyzes the exact asymptotic Type-I error for fixed $\alpha$ . It shows that to achieve post-hoc validity, the intervals must be slightly wider than standard asymptotic intervals, resulting in a Type-I error strictly less than $\alpha$ (conservative). Notably, the R-WS method has an asymptotic Type-I error of zero due to its conservative width growth.

5. Significance and Impact

Bridging Asymptotics and Post-Hoc Validity: This work is the first to rigorously extend the theory of e-values to the asymptotic regime. It allows practitioners to use the efficiency of large-sample approximations without sacrificing the flexibility of data-driven decision-making.
Weaker Assumptions: By moving to asymptotic e-variables, the paper reduces the moment requirements compared to nonasymptotic e-value methods, making these tools applicable to a broader class of real-world data (e.g., financial returns, heavy-tailed phenomena).
New Statistical Objects: The introduction of Asymptotic E-Processes and Post-Hoc Asymptotic Confidence Sequences unifies two major strands of modern statistics: sequential inference (time-uniformity) and post-hoc inference (data-dependent $\alpha$ ).
Practical Utility: The authors provide Python implementations and recommend Ex Ante Anchoring for the IWR method as the best balance of tightness and robustness for standard applications, while recommending R-WS for scenarios requiring optional continuation (stopping rules).

In summary, this paper provides a rigorous theoretical foundation and practical toolkit for statisticians who need to perform flexible, data-adaptive inference on large datasets while maintaining strict frequentist guarantees.