Confidence intervals for the Poisson distribution

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a detective trying to solve a mystery: How many times did a rare event happen?

In the world of physics (and many other sciences), events like radioactive decays or particle collisions happen randomly. Sometimes you see them 10 times, sometimes 2 times, and sometimes 0 times. This randomness follows a specific rule called the Poisson distribution.

The problem isn't counting the events; it's telling the story of what that count means. If you counted 3 events, does that mean the "true" rate of the universe is exactly 3? Probably not. It could be 2, or 4, or even 10, just because of random luck.

Scientists need a way to say: "Based on seeing 3 events, the true rate is likely somewhere between X and Y." This range is called a Confidence Interval.

For decades, physicists have been arguing about the best way to draw this range. Some methods are too wide (wasting data), some are too narrow (lying about precision), and some behave strangely when you look at them closely.

Frank Porter's paper is like a judge settling a courtroom dispute between all these different methods. He asks: "Which method tells the truth about the measurement without getting confused by what we think the truth should be?"

Here is the breakdown of his verdict, using simple analogies.

1. The Core Conflict: Description vs. Interpretation

Imagine you are describing a photo of a blurry face.

Description: "The photo shows a blurry blob that looks like it might be a dog." (This is what the paper focuses on: describing the data exactly as it is).
Interpretation: "I am 95% sure that the blurry blob is actually a dog." (This is what people often want to do, but it requires guessing about the "truth" before seeing the data).

The author argues that most confusion comes from mixing these up. We should first describe the measurement objectively (the blurry blob) before trying to guess the truth (the dog). If we try to force the description to fit our physical beliefs (e.g., "The rate can't be negative!"), we end up with confusing, broken math.

2. The Contenders (The Methods)

The paper reviews many different "rulers" scientists use to measure the uncertainty. Here are the main characters:

The "Garwood" Ruler (The Old Reliable):
- How it works: It's a classic, conservative method. It draws a wide net to make sure it never misses the true value.
- Pros: It's consistent. If you look at the same data with different levels of certainty, the ranges fit inside each other perfectly (like Russian nesting dolls). The math behaves smoothly.
- Cons: It's sometimes a bit too wide (over-covering), meaning it might say the answer is between 1 and 10 when it's really between 3 and 4. It's "safe" but not "tight."
The "Crow & Gardner" Ruler (The Tightrope Walker):
- How it works: It tries to make the net as small as possible to be more precise.
- Pros: It's often shorter (more precise) than Garwood.
- Cons: It's chaotic. If you change the confidence level slightly (from 90% to 95%), the range might jump wildly or even exclude the most likely answer. It's like a tightrope walker who falls off if the wind blows a little.
The "Feldman-Cousins" Ruler (The Physical Enforcer):
- How it works: It forces the answer to stay in the "physical" zone (e.g., it won't allow a negative number of particles).
- Pros: It feels intuitive to physicists who hate negative numbers.
- Cons: When the data is weird (like seeing fewer events than the background noise), this ruler shrinks the range to almost zero. It tricks you into thinking you have super-precise knowledge when you actually have very little. It hides the fact that the background noise fluctuated wildly.
The "Bayesian" Ruler (The Believer):
- How it works: It starts with a guess (a "prior") about what the answer might be, then updates it with data.
- Pros: Great for making decisions.
- Cons: It depends on your initial guess. If two people have different beliefs, they get different answers. The paper argues this is about belief, not description of the measurement.

3. The "Averaging" Trap

The paper also warns about a common mistake: Averaging results.
Imagine you have 10 different experiments, each with its own confidence interval. You might think, "I'll just average the middle numbers and the widths."

The Trap: If you do this with Poisson data, you can accidentally create a result that is less accurate than the individual parts. It's like averaging a bunch of blurry photos and expecting a sharp image. The math breaks down unless you go back to the raw data (the original counts) and re-calculate everything together.

4. The Verdict: Why Garwood Wins

After testing all these rulers against a list of "Desirable Properties" (like: Does it behave smoothly? Does it nest? Does it give sensible "p-values" which are like probability scores?), the author declares a winner:

The Garwood Interval.

Why?

It's Honest: It describes the measurement without trying to force it into a "physical" box that distorts the math.
It's Stable: If you tweak the confidence level, the answer changes smoothly. It doesn't jump around.
It's Consistent: The ranges nest perfectly (a 90% range is always inside a 95% range).
It Makes Sense: The "p-values" (probability scores) it generates are intuitive and continuous.

The Trade-off:
Yes, the Garwood interval is sometimes a little wider than necessary. But the author argues that it is better to be slightly too safe and consistent than to be precise but chaotic. A method that jumps around or gives weird answers when you look at it from a slightly different angle is dangerous for science.

Summary in One Sentence

When counting rare, random events, don't try to force the math to fit your physical beliefs; instead, use the Garwood method because it provides a stable, consistent, and honest description of the data, even if it's a little bit wider than the other options.

The Takeaway for Everyday Life:
When you are unsure about something, it's better to have a wide, reliable estimate that doesn't change when you look at it from a different angle, than a narrow, precise estimate that falls apart the moment you test it.

Technical Summary: Confidence Intervals for the Poisson Distribution

Author: Frank C. Porter (California Institute of Technology)
Context: Physical science measurements, particularly in particle physics involving rare events and low-count statistics.

1. Problem Statement

The Poisson distribution is ubiquitous in physical sciences for modeling counting experiments (e.g., particle detection). Despite its simplicity, there is significant confusion within the physics community regarding how to describe measurement results derived from Poisson sampling.

Descriptive vs. Interpretive: The paper argues for a strict distinction between descriptive statistics (summarizing the observed data $n$ without making claims about the true parameter value $\theta$ ) and interpretive statistics (making probabilistic statements about the "truth" of $\theta$ , often requiring Bayesian priors).
The Core Issue: Physicists often conflate these two domains, leading to the use of methods that are either statistically invalid (under-covering) or conceptually inconsistent (e.g., truncating intervals to "physical" regions, which obscures the observation of downward fluctuations).
Goal: To evaluate various methods for constructing confidence intervals for the Poisson mean ( $\mu = \theta + b$ , where $\theta$ is signal and $b$ is known background) based on desirable properties for descriptive statistics and recommend a robust standard.

2. Methodology and Framework

The author evaluates confidence interval construction methods based on a set of "desirable properties" derived from frequentist principles. The analysis focuses on exact intervals (guaranteed to satisfy the coverage probability $1-\alpha$ without approximation) rather than asymptotic approximations (like $\sqrt{n}$ ).

Key Desirable Properties Evaluated:

Exactness: The interval must satisfy $P(\theta \in C_\alpha(N)) \geq 1-\alpha$ for all $\theta$ .
Connectedness: The interval should be a single continuous range.
Contains MLE: The interval should contain the Maximum Likelihood Estimator ( $\hat{\theta} = n - b$ ).
Optimal Coverage: Minimizing over-coverage (conservatism) while maintaining exactness.
Length: The interval should be as short as possible.
Ordering/Monotonicity: Bounds should increase monotonically with the observation $n$ .
Nesting: Intervals for higher confidence levels should strictly contain those for lower levels.
Continuity: Bounds should vary continuously with the confidence level.
Sensible p-values: The derived p-values should be continuous and bimonotonic (decreasing as the null hypothesis moves away from the observation).
Scale: Intervals should intuitively scale with $\sqrt{n}$ .

Methods Analyzed:

Conventional Statistics: Garwood (Equal-tailed/Fiducial), Sterne, Crow & Gardner, Blaker, Kabaila-Byrne, Likelihood Ratio (LR) inversion, Score test inversion.
Particle Physics Specifics: Feldman-Cousins (FC), CLs method.
Bayesian Approaches: Intervals using Uniform and Jeffreys priors (evaluated for frequentist coverage properties).
Approximations: The standard $\sqrt{n}$ error bar.

3. Key Contributions and Findings

A. The Role of "Unphysical" Regions

The paper strongly argues that for descriptive statistics, likelihood functions and confidence intervals should not be truncated to physical regions (e.g., $\theta \geq 0$ ).

Reasoning: Truncation destroys the sufficiency of the statistic and obscures the reality of downward background fluctuations. If an observation $n$ is lower than the expected background $b$ , the descriptive result $\hat{\theta} = n-b$ is negative. This is a valid description of the measurement, even if the "true" physics forbids negative rates.
Implication: Methods like Feldman-Cousins, which force intervals to be non-negative, are criticized for producing intervals that are too short (suggesting false precision) when large negative fluctuations occur.

B. Evaluation of Specific Methods

Garwood Interval (Equal-tailed):
- Pros: Strictly nested, continuous in confidence level, contains the MLE, provides continuous and bimonotonic p-values, and is optimal among strictly nested equal-tailed intervals.
- Cons: Known for substantial over-coverage (intervals are wider than necessary) compared to other methods.
Crow & Gardner / Sterne / Blaker:
- Pros: Generally shorter lengths and better coverage (less over-coverage) than Garwood.
- Cons: Fail to satisfy nesting and continuity. Their p-values are discontinuous functions of the null hypothesis, and the intervals can behave counter-intuitively (e.g., shrinking as confidence level increases).
Feldman-Cousins (FC) & CLs:
- FC: Designed to avoid the "flip-flop" between one-sided and two-sided intervals. However, when background is present, it produces intervals that are artificially small for low counts, violating the intuitive scale of $\sqrt{n}$ .
- CLs: Conservative but produces very large intervals for small signals, making it poor for descriptive purposes.
Bayesian Intervals (Uniform/Jeffreys):
- While they often have good frequentist properties, they are inherently interpretive (based on priors) and do not guarantee exact frequentist coverage.
Averaging Observations:
- The paper demonstrates that simply averaging quoted confidence intervals (using inverse variance weighting) is dangerous. It can lead to severe under-coverage. The correct approach is to combine the raw Poisson counts (summing $n$ and $T$ ) before calculating the interval.

C. The "Sensible p-value" Criterion

A major contribution is the emphasis on the behavior of p-values derived from the intervals. The author shows that methods optimizing for length (like Sterne or Likelihood Ratio) often produce discontinuous p-values. A small change in the null hypothesis can cause a large jump in the p-value, which is counter-intuitive and undesirable for a descriptive framework. Only the Garwood interval provides continuous, bimonotonic p-values.

4. Results and Recommendations

Primary Recommendation:
The author recommends the Garwood (Equal-tailed) Interval as the standard for describing Poisson sampling results in physical sciences.

Justification:
While the Garwood interval is not the shortest (it over-covers), it is the only method that simultaneously satisfies:

Exactness (no under-coverage).
Connectedness.
Strict Nesting.
Continuity with respect to confidence level.
Bimonotonicity and continuity of derived p-values.
Inclusion of the Maximum Likelihood Estimator.

The author argues that the "bizarre" behavior of other methods (discontinuity, lack of nesting, counter-intuitive p-values) outweighs the benefit of slightly shorter intervals. The Garwood interval provides a consistent, intuitive, and mathematically robust framework for describing measurements.

Secondary Recommendations:

One-sided limits: Use the standard frequentist upper limit (based on $\chi^2$ inversion) rather than CLs for pure description.
Averaging: Do not average confidence intervals directly. Always return to the raw counts and perform the combination on the joint Poisson distribution.
Graphing: When plotting histograms with background subtraction, do not fear negative error bars; they accurately reflect the data.

5. Significance

This paper resolves a long-standing debate in experimental physics regarding the reporting of low-count data.

Clarification: It rigorously separates the description of data from the inference of truth, arguing that frequentist methods are best suited for the former.
Standardization: It provides a strong theoretical justification for using the Garwood interval, which is already the default in major software packages (MATLAB, R), potentially reducing confusion and inconsistency in published results.
Critique of "Optimization": It challenges the statistical community's obsession with minimizing interval length at the expense of other fundamental properties (like nesting and p-value continuity), suggesting that such optimizations often lead to misleading descriptions of experimental reality.