Confidence as Forecast: A Decision-Theoretic Interpretation of Confidence Intervals

Here is an explanation of the paper "Confidence as Forecast," translated into simple language with everyday analogies.

The Big Problem: The "It's Either Yes or No" Confusion

Imagine you are a statistician. You run an experiment and calculate a Confidence Interval (CI). This is a range of numbers (like "between 10 and 20") that you think contains the true answer to your question.

For decades, the standard rule (from the famous statistician Jerzy Neyman) has been: "Once you see the numbers, stop talking about probability."

The logic goes like this: The true answer is a fixed number. Once you build your interval, it either does or does not contain that number. It's a binary fact. So, the probability is either 100% (it's in there) or 0% (it's not). Since you don't know which one it is, you aren't allowed to say, "I'm 95% sure." You just have to say, "I built this interval, and the method works 95% of the time in the long run."

The Problem: This feels unsatisfying. If you have to make a decision right now based on that specific interval, saying "It's either 0% or 100%" doesn't help you. It's like a weatherman saying, "It's either raining or it's not," without telling you if you need an umbrella.

The Author's Solution: Confidence as a "Weather Forecast"

Scott Lee, the author, suggests we change our perspective. Instead of treating a Confidence Interval as a final verdict, we should treat it as a probability forecast.

Think of it like this:

The Old View: A CI is a locked box. Inside is either a "Win" or a "Loss." Once you open the box, the game is over.
The New View: A CI is a weather forecast. You don't know if it will rain exactly at 2:00 PM, but you can predict the chance of rain based on the data you have.

Lee argues that even after you see the interval, you can still make a smart prediction about whether it covers the truth, using a tool called Proper Scoring Rules.

The Analogy: The "Brier Score" (The Penalty Game)

Imagine you are playing a game where you have to guess if a specific interval covers the truth.

If you guess "100% sure it covers" and it doesn't, you get a huge penalty.
If you guess "0% sure" and it does, you get a huge penalty.
If you guess "50% sure" and it turns out to be a coin flip, you get a small penalty.

Lee proves mathematically that if you don't have any special extra information, the best possible guess to minimize your penalty is the nominal confidence level (e.g., 95% or 50%).

So, even after you see the interval, if you have no other clues, saying "There is a 95% chance this covers the truth" is the most honest, mathematically optimal thing you can say. It's not a "belief"; it's a forecast based on how the machine works.

The Twist: When You Can Update Your Forecast

Here is where it gets really interesting. Sometimes, the interval itself gives you extra clues that change the odds.

The "Lost Submarine" Analogy
Imagine a submarine is lost in a 10-meter long hallway. You drop two bubbles to find it.

Scenario A: The bubbles are far apart (covering 9 meters of the hallway). Your interval is huge.
Scenario B: The bubbles are right next to each other (covering only 1 meter). Your interval is tiny.

In both cases, the "Confidence Interval" method says, "We are 50% confident."

The Old View: "It's 50% either way. Don't look at the size."
The New View: "Wait a minute! If the bubbles are far apart, the interval is huge, so it's very likely to catch the sub. If the bubbles are close together, the interval is tiny, so it's very unlikely to catch the sub, even though the math says '50%'."

Lee shows that in these specific cases, you should update your forecast.

If the interval is huge, you might say, "I'm 90% sure this covers it."
If the interval is tiny, you might say, "I'm only 10% sure."

This isn't guessing; it's using the shape of the data to refine your prediction, just like a meteorologist looks at a radar map to refine a rain forecast.

The "Monty Hall" Connection

The paper uses a game show analogy (Monty Hall) to prove this point.

In the game, you pick a door. The host opens a losing door.
The "Neyman" view: "The door you picked either has the car or it doesn't. Probability is 0 or 1." (This leads you to stay and lose).
The "Forecast" view: "Based on the rules of the game, switching doors gives me a 2/3 chance of winning." (This leads you to switch and win).

The author argues that treating confidence as a forecast allows us to make the "smart move" (switching doors) rather than getting stuck in the "0 or 1" trap.

Summary: What Should You Do?

The paper gives a simple guide for applied work:

Default to the Number: If you see a 95% Confidence Interval and you don't know anything special about the situation, just treat it as a 95% forecast. It's the best guess you can make.
Look for Clues: If the interval looks weird (like it's super wide or super narrow in a specific way that the math says matters), update your forecast. Use the shape of the interval to adjust your percentage up or down.
Forget the "Belief" Trap: You don't need to believe in "subjective feelings" or "Bayesian priors" to do this. You are just acting like a smart forecaster who knows how the machine works.

The Takeaway:
Confidence Intervals aren't just rigid boxes that are either right or wrong. They are predictions. Sometimes the prediction is a flat "95%," and sometimes the data tells you to adjust that number. By treating them as forecasts, we can make better decisions without breaking the rules of frequentist statistics.

Here is a detailed technical summary of the paper "Confidence as Forecast: A Decision-Theoretic Interpretation of Confidence Intervals" by Scott Lee.

1. Problem Statement

The paper addresses the long-standing interpretational confusion surrounding Confidence Intervals (CIs) in frequentist statistics, specifically regarding the status of a single realized interval after data has been observed.

The Core Conflict: Jerzy Neyman, the inventor of CIs, argued that once an interval is constructed, the coverage event is deterministic (either the parameter $\theta$ is in the interval or it is not). Therefore, he refused to assign a non-degenerate probability to the coverage of a specific interval, suggesting one should simply "state that the interval covers."
The Consequence: This leads to the "Fundamental Confidence Fallacy" (FCF) debate. Critics argue that if one cannot assign a probability to a specific interval's coverage, the concept of "confidence" is meaningless post-hoc. Conversely, applied practitioners often mistakenly treat the nominal confidence level (e.g., 95%) as the probability that a specific realized interval contains the parameter.
The Gap: There is a lack of a principled frequentist framework that allows for a probabilistic forecast of coverage for a single interval without resorting to Bayesian priors or subjective degrees of belief.

2. Methodology

Lee proposes a decision-theoretic framework that reinterprets "confidence" as a probabilistic forecast for a Bernoulli outcome (coverage vs. non-coverage).

Formal Setup:
- Let $Z(X) = \mathbb{1}\{\theta \in I(X)\}$ be the coverage indicator, a Bernoulli random variable.
- The design guarantees $E_\theta[Z(X)] = 1-\alpha$ (the nominal level).
- The author distinguishes between three layers of probability:
  1. Degenerate Conditional: Given the full data $X$ , $Z(X) \in \{0, 1\}$ .
  2. Design-Level Unconditional: $P_\theta(Z(X)=1) = 1-\alpha$ .
  3. Predictive (Information-Relative): A forecast $q(X)$ based on available information (e.g., interval width) but not the full truth of $\theta$ .
Evaluation Metric:
- The paper utilizes strictly proper scoring rules (e.g., Brier score, Log score) to evaluate the quality of forecasts. A strictly proper scoring rule $S(q, z)$ ensures that the expected score is minimized if and only if the forecast $q$ equals the true probability of the event.
- The goal is to find the forecast $q^*$ that minimizes the expected loss (risk) $E[S(q, Z)]$ .
Analytical Approach:
- The author analyzes the risk of constant forecasts (e.g., always saying "covers" or always saying "1- $\alpha$ ") versus data-dependent forecasts.
- The paper introduces the concept of $\theta$ -free statistics: statistics derived from the data (like relative interval width) whose conditional distribution of coverage does not depend on the unknown parameter $\theta$ .
- Two thought experiments are used to illustrate the theory: a "Monty Hall" style shell game and the "Lost Submarine" example (Morey et al., 2016).

3. Key Contributions

A. Reframing Confidence as a Forecast

The paper argues that the nominal level $1-\alpha$ is not just a long-run frequency guarantee but is the unique optimal constant forecast for coverage under any strictly proper scoring rule.

Pre-trial: Before seeing data, $q = 1-\alpha$ minimizes expected loss.
Post-trial (General): Even after seeing data, if no additional $\theta$ -free information is available, $q = 1-\alpha$ remains the optimal constant forecast. It strictly dominates the "either/or" stance (forecasting $q=1$ or $q=0$ ) in terms of expected loss.

B. Theorem on $\theta$ -Free Refinements

The paper presents Theorem 3.1, which establishes that if a statistic $T(X)$ exists such that the conditional coverage $P_\theta(\theta \in I(X) \mid T(X))$ is a function $g(T)$ independent of $\theta$ , then the optimal forecast is $q^*(X) = g(T(X))$ .

This allows for data-dependent updates to the confidence level without using priors.
If such a statistic exists, the conditional forecast strictly improves predictive performance (lowers expected loss) compared to the constant $1-\alpha$.

C. Resolution of Interpretational Paradoxes

The framework resolves famous counter-examples (like the Lost Submarine) by distinguishing between:

The Design Guarantee: The procedure covers 50% of the time in the long run.
The Conditional Forecast: Specific realizations (e.g., a very narrow interval in a uniform model) may have a conditional coverage probability significantly different from 50%.

The paper demonstrates that in the "Lost Submarine" example, a 50% CI that is very narrow actually has a conditional coverage of ~33%, while a wider one might be higher. The optimal forecast updates based on the interval's width (a $\theta$ -free statistic), resolving the "paradox" that a 50% CI can have 33% coverage.

4. Results

**Optimality of $1-\alpha $:** In standard unbounded, translation-invariant models (e.g., normal mean), the realized interval endpoints carry no$ \theta $-free information about coverage. Thus, the optimal post-trial forecast remains the nominal level$ 1-\alpha$.
Improvement in Finite-Window Models: In models with bounded support (like the Uniform "Submarine" model), the relative width of the interval is a $\theta$ $θ$ -free statistic.
- Simulation Results: Using the Brier score, the paper shows that using conditional coverage based on interval width reduces the forecast error significantly compared to the constant $1-\alpha$ forecast.
- Example: For the Non-Parametric (NP) interval in the submarine model, the Brier score dropped from 0.250 (constant 0.5 forecast) to 0.117 (conditional forecast based on width).
Nested Intervals: The paper analyzes nested intervals (e.g., UMP vs. SD procedures). It shows that the joint coverage probability changes based on the nesting configuration, and conditioning on the nesting relationship provides a better forecast than the marginal design-level probability.
Monty Hall Analogy: The thought experiment demonstrates that refusing to update probabilities based on design-level success rates (Neyman's strict "0 or 1" view) leads to suboptimal decision-making (losing money in the shell game), whereas treating the design probability as a forecast leads to the optimal strategy.

5. Significance and Implications

Theoretical Resolution: The paper provides a rigorous frequentist justification for assigning a probability to a single CI's coverage, bridging the gap between Neyman's behaviorist approach and the intuitive need for post-hoc assessment. It validates the use of $1-\alpha$ as a "predictive probability" without invoking Bayesian subjectivity.
Pedagogical Shift: The author suggests a new way to teach CIs:
1. Layer 1: The deterministic truth (0 or 1) given full data.
2. Layer 2: The design-level guarantee ($1-\alpha$).
3. Layer 3: The predictive forecast, which can be updated if $\theta$ -free statistics (like relative width) are available.
  This clarifies why students are often confused and provides a consistent logic for when to update the confidence level.
Practical Application: For applied statisticians, the paper offers a "rule of thumb":
- If the model is standard (unbounded), stick to the nominal level $1-\alpha$ post-hoc.
- If the model has bounded support or specific design features where interval geometry correlates with coverage (e.g., uniform distributions), calculate and use the conditional coverage probability for a more accurate forecast.
Philosophical Impact: It challenges the strict dichotomy between "frequentist" (long-run) and "Bayesian" (subjective) interpretations. It argues that "confidence" is an information-relative frequentist concept: a forecast based on the available information relative to the design, minimizing error rates for the agent.

In conclusion, Lee's work transforms the confidence interval from a static, often misunderstood tool into a dynamic probabilistic forecast that is mathematically optimal under decision theory, resolving historical paradoxes while remaining strictly within the frequentist paradigm.

Confidence as Forecast: A Decision-Theoretic Interpretation of Confidence Intervals

The Big Problem: The "It's Either Yes or No" Confusion

The Author's Solution: Confidence as a "Weather Forecast"

The Analogy: The "Brier Score" (The Penalty Game)

The Twist: When You Can Update Your Forecast

The "Monty Hall" Connection

Summary: What Should You Do?

1. Problem Statement

2. Methodology

3. Key Contributions

A. Reframing Confidence as a Forecast

B. Theorem on θ\thetaθ-Free Refinements

C. Resolution of Interpretational Paradoxes

4. Results

5. Significance and Implications

More like this

Two-stage Adaptive Design Cluster Randomised Trials

Change Point Detection for Cell Populations Measured via Flow Cytometry

Preoperative Decline and Postoperative Recovery of Wearable-Derived Physical Activity Over a Four-Year Perioperative Period in Total Knee and Hip Arthroplasty: Evidence from the All of Us Research Program

Robust Estimation of Location in Matrix Manifolds Using the Projected Frobenius Median

Two Localization Strategies for Sequential MCMC Data Assimilation with Applications to Nonlinear Non-Gaussian Geophysical Models

B. Theorem on $\theta$ -Free Refinements