Anytime-valid simultaneous lower confidence bounds for… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a detective trying to solve a massive crime scene. Instead of one clue, you have 100,000 clues (hypotheses) scattered across a city. Your goal is to find out which clues actually point to the criminal (the "true discoveries") and which are just red herrings (false alarms).

In the past, detectives had to wait until they collected all the clues before they could make a final report. If they stopped early, their report might be legally invalid. If they kept collecting clues after finding the criminal, they might accidentally change their conclusion based on new, irrelevant noise.

This paper introduces a new kind of super-magnifying glass that lets you look at your clues anytime you want, stop whenever you want, and still be 100% sure your conclusions are mathematically sound.

Here is the breakdown of the paper's ideas using simple analogies:

1. The Problem: The "Fixed Sample Size" Trap

Traditionally, statisticians act like a baker who must bake exactly 50 cakes before tasting any.

The Rule: You must decide in advance how many people (or data points) you will study.
The Flaw: What if you find the answer after 10 people? You're forced to waste time and money studying 40 more. What if you need to stop early because of a budget cut? The old math says, "Your results are invalid because you didn't finish the recipe."

2. The Solution: "Anytime-Valid" Inference

The author, Friederike Preusse, proposes a method that acts like a live-updating GPS.

The Analogy: Imagine you are driving to a destination. A normal map tells you, "You will arrive in 30 minutes if you drive the whole way." But if you stop at a gas station, the map says, "Invalid route."
The New GPS: This new method says, "No matter when you stop, or how long you've been driving, I can tell you exactly how close you are to the destination with a guaranteed safety margin."
The Benefit: In expensive fields like fMRI brain scanning (where every minute costs thousands of dollars), researchers can stop the scan the moment they are confident enough, saving huge amounts of money and time, without breaking the rules of statistics.

3. The Core Mechanism: The "Safe E-Process"

How does this magic work? The paper uses something called an e-process.

The Metaphor: Think of an e-process as a betting chip or a trust score.
- If a hypothesis is a "fake" (a false discovery), the trust score is designed to stay low.
- If a hypothesis is "real," the trust score grows rapidly as you gather more data.
The Safety Net: The math guarantees that even if you look at the score every single second, the chance of the score accidentally going high for a fake hypothesis is still tiny. It's like a casino that guarantees it will never go bankrupt, even if you check the chips every second.

4. The "Closed Testing" Framework: The Team of Judges

To handle 100,000 clues at once, the method uses a system called Closed Testing.

The Analogy: Imagine you have a team of judges. To reject a "group" of clues (say, "all clues in the kitchen"), you don't just look at the kitchen. You have to prove that every possible combination of clues in the kitchen is guilty.
The Challenge: With 100,000 clues, there are more combinations than atoms in the universe. Checking them all would take forever.
The Shortcut: The author found a computational shortcut. Instead of checking every single combination, the method sorts the clues by their "trust scores" and only checks the most suspicious ones. It's like a detective who knows that if the top 5 suspects are innocent, the whole group is innocent, so they don't need to interrogate the bottom 99,995 suspects individually.

5. The Real-World Test: The Brain Scan

The author tested this on real data from a brain imaging experiment (fMRI).

The Setup: They scanned brains while people did a word-matching game. They wanted to know: "Which parts of the brain are actually lighting up?"
The Result: They simulated stopping the experiment at different times (after 15 people, 30 people, 53 people).
The Outcome: At every single stop point, the method gave a lower confidence bound.
- Example: "After scanning 30 people, we are 80% sure that at least 400 brain cells in the 'Language Center' are active."
- As they scanned more people, this number grew, giving them more confidence.
- Crucially, they could stop early if the number was high enough, or keep going if they wanted more precision, without ever invalidating the previous results.

Why This Matters

For Scientists: It stops the "waste" of collecting data you don't need.
For Patients: In medical trials, if a new drug is clearly working, you can stop the trial early and get the drug to patients faster. If it's clearly failing, you can stop early to save lives and resources.
For the General Public: It means scientific discoveries can be made faster, cheaper, and with a higher degree of certainty that the researchers didn't just get lucky by peeking at the data too early.

In a nutshell: This paper gives scientists a "safe pause button." They can collect data, check their results, stop, or continue, and the math guarantees that their conclusions about what is "real" and what is "noise" remain valid at every single step of the journey.

1. Problem Statement

In multiple hypothesis testing, researchers often need to estimate the True Discovery Proportion (TDP) within specific subsets of hypotheses. The TDP is defined as $1 - \text{FDP}$ , where FDP is the False Discovery Proportion. While methods exist to compute simultaneous lower confidence bounds for the TDP (e.g., using the closed testing framework), these established methods typically require a fixed sample size.

This creates a significant limitation in practical applications like neuroscience (fMRI) or genomics, where data acquisition is time-consuming and expensive. Researchers often wish to:

Stop data collection early if results are conclusive (optional stopping).
Resume data collection later if results are inconclusive.
Update bounds sequentially as new data arrives.

Standard methods lose their validity (specifically, the Type I error control) if the sample size is determined by the data itself (optional stopping). The paper addresses the need for Anytime-Valid (AV) simultaneous lower confidence bounds for the TDP that remain valid regardless of when the data analysis is stopped or resumed.

2. Methodology

The proposed method integrates two distinct statistical frameworks: Closed Testing and Safe Anytime-Valid Inference (SAVI) using e-processes.

A. Theoretical Framework

Closed Testing: The method builds upon the closed testing principle (Marcus et al., 1976), which is the only admissible framework for TDP confidence bounds. It involves testing all intersection hypotheses ( $H_I = \bigcap_{i \in I} H_i$ ).
Anytime-Validity via E-processes: Instead of traditional p-values, the method utilizes e-processes. An e-process is a non-negative stochastic process that, under the null hypothesis, has an expected value $\le 1$ at any stopping time. This property ensures that the probability of ever falsely rejecting a true hypothesis is bounded by $\alpha$ , regardless of the stopping rule.
Local Tests: The authors define a family of anytime-valid local level- $\alpha$ tests based on e-processes: $\phi^{[n]}_{\alpha}(E) = \mathbb{I}\{E^{[n]}_I \ge 1/\alpha\}$ . To ensure the bounds are "carefree" (non-increasing over time), the authors utilize the running maximum of these tests.

B. The Proposed Procedure

Data Stream: Data is observed sequentially at time points $n = 1, 2, \dots$ .
E-process Construction: For each elementary hypothesis $H_i$ , an e-process $E^{[n]}_i$ is computed.
Merging: For intersection hypotheses $H_I$ , the e-process is constructed using the arithmetic mean of the constituent elementary e-processes: $E^{[n]}_I = \frac{1}{|I|} \sum_{i \in I} E^{[n]}_i$ . This is the only admissible merging function under arbitrary dependence.
Closed Testing with E-process-Based Local Tests:
- Let $X^{[n]}_\alpha$ be the set of intersection hypotheses rejected by closed testing at time $n$ , where the local tests are the anytime-valid e-process-based tests defined in Section 2.A. Using e-process-based local tests in the closed testing procedure is what ensures anytime-validity of the rejection set — i.e., valid Type I error control under any stopping rule, including data-dependent stopping.
- The upper confidence bound for the number of false discoveries, $\tau(R)$ , is defined as the size of the largest hypothesis subset $I \subseteq R$ that is not rejected by closed testing.
- To additionally ensure the carefree property (the bounds monotonically improve — lower bounds on TDP only increase, upper bounds on false discoveries only decrease), the final bound at time $n$ is taken as the minimum of the bounds observed from time $0$ to $n$ :
  $\tilde{c}^{[n]}_\alpha(R) = \min_{0 \le \ell \le n} \left( \max \{ |I| : I \subseteq R, I \neq \emptyset, I \notin X^{[\ell]}_\alpha \} \right)$
- The lower confidence bound for the TDP is then $\tilde{d}^{[n]}_\alpha(R) = 1 - \tilde{c}^{[n]}_\alpha(R) / |R|$ .

C. Computational Shortcut

A major challenge in closed testing is the exponential number of hypotheses ( $2^m - 1$ ). The paper derives a computational shortcut (Lemma 1) specifically for the case where the arithmetic mean is used for merging.

Instead of testing all subsets, the algorithm sorts the e-process values within the discovery set $R$ .
It iteratively checks the "least likely to be rejected" hypothesis of size $h$ (the one with the $h$ smallest e-process values).
This reduces the computational complexity from exponential to linear $O(m)$ (or $O(m \log m)$ with sorting) per time step, making the method feasible for large-scale problems (e.g., $m > 100,000$ in fMRI).

3. Key Contributions

Anytime-Valid TDP Bounds: The first procedure to provide simultaneous lower confidence bounds for the TDP that are valid under optional stopping and sequential updating in an offline setting (fixed set of hypotheses).
Integration of E-processes and Closed Testing: Successfully combines the rigorous error control of closed testing with the flexibility of e-processes.
Computational Efficiency: Derivation of a specific algorithm (Lemma 1) that allows the application of these bounds to high-dimensional data (thousands to hundreds of thousands of hypotheses) without the prohibitive cost of full closed testing.
Carefree Property: The bounds are "carefree," meaning they are non-increasing over time; gathering more data never invalidates previous conclusions or makes the bounds looser.

4. Results

The paper validates the method through simulations and a real-world case study.

Simulation Study

Setup: $m=1,000$ hypotheses, sequential observation of up to 100 subjects. Data was correlated ( $\rho \in \{0.2, 0.6\}$ ) with varying effect sizes.
Validity: The empirical non-coverage rate was consistently $\le \alpha$ (0.2), confirming the bounds are valid under optional stopping, regardless of dependency structure or sample size.
Power: The bounds converged to the true TDP as sample size increased.
- Convergence was faster for larger effect sizes.
- Trade-off: The anytime-valid bounds required a larger sample size to become tight compared to standard (non-anytime) All-Resolution Inference (ARI) bounds. This is the expected "cost" of the optional stopping guarantee.
- Stronger dependency between hypotheses generally led to slightly more powerful bounds.

Case Study: fMRI Experiment

Data: Semantic task fMRI data from 56 participants (OpenNeuro dataset ds007535).
Application: The method was applied to identify active brain regions (Regions of Interest - ROIs) as subjects were scanned sequentially.
Findings:
- Using the "mom" e-process (based on a non-local moment prior), the procedure identified activation in all 7 ROIs previously identified by meta-analysis (Binder et al., 2009).
- At $n=53$ subjects, the method provided simultaneous 80% lower confidence bounds for the proportion of active voxels.
- For the Left Inferior Frontal Gyrus pars triangularis (IFGpt), the lower bound indicated at least 38.81% of voxels were active.
- The bounds were not yet fully converged, suggesting that collecting more data would likely increase the lower bounds, demonstrating the utility of the sequential nature of the method.

5. Significance

Practical Utility for Neuroscience/Genomics: This method directly addresses the "optional stopping" dilemma in expensive experiments. Researchers can stop data collection early if the lower bound is sufficiently high, or continue if it is not, without inflating the Type I error rate.
Scalability: The computational shortcut makes the method applicable to modern high-dimensional datasets (e.g., whole-brain fMRI with >100k voxels), which were previously intractable for simultaneous TDP bounds.
Robustness: The method does not assume a specific dependency structure between hypotheses, making it robust for complex real-world data.
Future Directions: The paper highlights the need for specialized e-processes tailored to specific data structures (e.g., spatial-temporal dependencies in fMRI) to improve power and suggests extending the framework to other error rates (e.g., knock-offs).

In summary, Preusse provides a rigorous, computationally feasible, and flexible statistical tool for sequential multiple testing, bridging the gap between theoretical error control and the practical realities of time- and cost-constrained scientific research.

Anytime-valid simultaneous lower confidence bounds for the true discovery proportion