Second order asymptotics for the number of times an estimator is more than epsilon from its target value

Imagine you are a coach training a team of runners (estimators) to find a hidden treasure (the true parameter, $\theta$ ) in a vast, foggy field.

Your goal isn't just to see who finds the treasure eventually (that's what standard statistics usually checks). Your goal is to count how many times each runner steps outside a small, safe circle of radius $\epsilon$ around the treasure before they finally settle down.

Let's call this count the "Miss Count" ( $Q_\epsilon$ ).

The Problem: The Tie

In the world of statistics, we often have two runners who are equally good. They both eventually find the treasure, and if you look at their long-term average speed, they are identical. Standard statistics says, "Great, they are tied. Pick either one."

But the authors of this paper, Nils Lid Hjort and Grete Fenstad, ask: "Wait a minute. If they are tied in speed, who stumbles more often while running?"

They want to know: Between two equally fast runners, which one makes fewer mistakes (steps outside the safe circle) along the way?

The First Order vs. The Second Order

First Order (The Old Way): This looks at the runners' average speed. If Runner A and Runner B both average 10 mph, the old method says they are equal.
Second Order (The New Way): This looks at the friction. Even if they have the same average speed, maybe Runner A stumbles a lot but recovers quickly, while Runner B glides smoothly. The paper develops a new way to measure this "stumbling" to break the tie.

The Analogy: The "Miss Count"

Imagine the treasure is a bullseye.

$\epsilon$ (Epsilon): This is the size of the bullseye. It's very small.
$Q_\epsilon$ : This is the total number of times a runner's foot lands outside that bullseye as they run their race (as the sample size $n$ grows).

The paper proves that if you shrink the bullseye ( $\epsilon$ ) to be microscopic, the total number of misses ( $Q_\epsilon$ ) becomes huge. However, if you multiply the misses by the size of the bullseye squared ( $\epsilon^2 \times Q_\epsilon$ ), you get a stable number. This number tells you how "wobbly" the runner is.

The Big Discovery: The "Perfect" Denominator

The authors apply this "Miss Count" theory to some classic statistics problems. They found that the formulas we use in textbooks aren't always the best at minimizing these "misses."

Here are their surprising findings, translated into everyday terms:

1. The Variance Problem (Measuring Spread)
When calculating how spread out a set of numbers is (variance), we usually divide by $N$ (the total count) or $N-1$ .

The Old Belief: $N-1$ is the "unbiased" choice. $N$ is the "maximum likelihood" choice.
The Paper's Verdict: Neither is the best at minimizing "misses."
The Winner: You should divide by $N - 1/3$ .
- Analogy: Imagine you are baking a cake. The recipe says "add 1 cup of flour." But if you want the cake to be perfectly stable (fewest errors), you actually need to add a tiny bit less than 1 cup. The paper says the "magic number" is $1/3$ of a cup less than the standard correction.

2. The Exponential Mean
When measuring the average time until an event happens (like a lightbulb burning out):

The Winner: A specific adjustment where you divide by $N + 1/3$ (conceptually).
The Result: The standard "Maximum Likelihood" method (which divides by $N$ ) actually makes 1/9 more errors than the optimized method.

3. The Squared Mean
When estimating the square of an average (like estimating the power of a signal):

The Winner: A specific adjustment where you add a small correction term.
The Result: The standard method underestimates the error, while the optimized method makes the fewest "misses."

Why Does This Matter?

You might ask, "Who cares about $1/3$ of a denominator? It's a tiny difference!"

The authors argue that in the real world, we often have to choose between two methods that look identical on paper. This "Second Order" analysis is the tie-breaker. It tells us:

"Method A and Method B are both good."
"But Method A will make you step outside the safety zone slightly more often than Method B."
"Therefore, if you want the smoothest ride with the fewest stumbles, pick Method B."

The "Brownian Motion" Connection

The paper gets a bit technical at the end, mentioning "Brownian motion" (the random jitter of particles in a fluid).

The Metaphor: Imagine the runners aren't just running on a track, but are actually tiny particles jittering in a fluid. The "Miss Count" is related to how much time these particles spend touching the walls of their container.
The authors show that the difference between two estimators behaves like the difference in time two jittery particles spend near the walls. This connects their statistical findings to deep physics-like laws of randomness.

Summary

This paper is about fine-tuning.
Just as a master chef knows that a pinch of salt makes a dish perfect, while a standard recipe might be "good enough," these statisticians found the "pinch of salt" (the $-1/3$ adjustment) that makes statistical estimators make the fewest possible mistakes as they get more data.

They didn't just find a new way to run the race; they found the exact stride length that prevents you from tripping over your own feet.

Here is a detailed technical summary of the paper "Second order asymptotics for the number of times an estimator is more than $\varepsilon$ from its target value" by Nils Lid Hjort and Grete Fenstad.

1. Problem Statement

The paper addresses the limitation of first-order asymptotic theory in comparing statistical estimators.

Context: Let $\{\hat{\theta}_n\}$ be a sequence of strongly consistent estimators for a parameter $\theta$ . Define $Q_\varepsilon$ as the random variable counting the number of times the estimator deviates from the true parameter by at least $\varepsilon$ (i.e., $|\hat{\theta}_n - \theta| \ge \varepsilon$ ) as $n \to \infty$ .
First-Order Limit: Previous work (Hjort and Fenstad, 1992) established that under regularity conditions, $\varepsilon^2 Q_\varepsilon$ converges in distribution to a functional of Brownian motion. Consequently, the expected number of errors scales as $E(Q_\varepsilon) \approx \sigma^2 / \varepsilon^2$ , where $\sigma^2$ is the asymptotic variance.
The Gap: When two competing estimators have the same asymptotic variance (and thus the same first-order limit distribution), their asymptotic relative efficiency (a.r.e.) is 1. Standard first-order metrics cannot distinguish between them.
Objective: The authors aim to develop a second-order asymptotic theory to distinguish between such estimators. Specifically, they seek to calculate the limit of the difference in expected errors, $E(Q_{1,\varepsilon} - Q_{2,\varepsilon})$ , as $\varepsilon \to 0$ , to identify the estimator that minimizes the total number of $\varepsilon$ -errors.

2. Methodology

The authors employ a combination of probabilistic limit theorems, Edgeworth expansions, and Brownian motion approximations.

The Criterion (Asymptotic Relative Deficiency - a.r.d.):
Instead of comparing ratios of sample sizes (Hodges-Lehmann deficiency) or ratios of expected errors, they define the second-order measure as:
$\text{a.r.d.} = \lim_{\varepsilon \to 0} E(Q_{1,\varepsilon} - Q_{2,\varepsilon})$
A negative value indicates that estimator 1 makes fewer errors than estimator 2 in the limit.
Analytical Tools:
1. Edgeworth Expansions: To approximate the cumulative distribution function (CDF) of the standardized estimator $T_n = \sqrt{n}(\hat{\theta}_n - \theta)/\sigma$ . The expansion includes terms involving skewness ( $\gamma$ ) and higher-order moments.
2. Taylor Approximations: Used to expand the boundaries of the error regions (where $|\hat{\theta}_n - \theta| \ge \varepsilon$ ) as functions of $n$ and the estimator parameters.
3. Riemann Sum Approximation: The sum of probabilities over $n$ is approximated by an integral over a continuous time variable $s = n/m$ (where $m = 1/\varepsilon^2$ ).
4. Brownian Motion Connection: The limit results are interpreted via the time Brownian motion spends outside specific boundaries (related to the "total relative time" variables).
Estimator Class: The analysis focuses heavily on estimators of the form:
$\hat{\theta}_n(c, d) = \frac{n}{n+c}\bar{X}_n + \frac{c}{n+c}d$
where $c$ and $d$ are tuning parameters (e.g., $c=0$ corresponds to the Maximum Likelihood Estimator, while other values correspond to Bayes or UMV estimators).

3. Key Contributions and Results

A. General Theory for Mean Estimation

For i.i.d. observations with mean $\xi$ , variance $\sigma^2$ , and skewness $\gamma$ , the authors derive the limit of the expected difference in errors for estimators $\hat{\xi}_n(c, d)$ versus the sample mean $\bar{X}_n$ :
$\lambda_0(c, d) = \lim_{\varepsilon \to 0} E(Q_\varepsilon(c, d) - Q_\varepsilon(0, 0)) = \frac{(\xi - d)^2}{\sigma^2}c^2 - 2\left(1 - \frac{\gamma}{3}\frac{\xi - d}{\sigma}\right)c$
Significance: Unlike the Hodges-Lehmann deficiency, this formula explicitly incorporates the skewness ( $\gamma$ ) of the underlying distribution, which affects the optimal choice of the estimator's tuning parameter.

B. Specific Applications

The paper applies the general theory to several classic estimation problems, revealing "optimal" denominators that differ from standard ML or unbiased choices:

Normal Mean (with Prior Knowledge):
- The optimal estimator minimizes $\lambda_0$ by choosing $d = \theta_0$ (prior mean) and $c = 1/\tau^2$ (inverse prior variance). This recovers the standard Bayesian estimator, providing a new justification based on minimizing $\varepsilon$ -errors.
Exponential Mean:
- For $X_i \sim \text{Exp}(1/\theta)$ , the skewness is $\gamma=2$ .
- The optimal $c$ is $1/3$.
- Result: The estimator $\frac{n}{n+1/3}\bar{X}_n$ makes fewer errors than the ML estimator ( $c=0$ ) or the UMV estimator ( $c=1$ ).
Normal Variance (The "N-1/3" Result):
- Estimating $\sigma^2$ using $\hat{\sigma}^2 = \frac{\sum (Y_i - \bar{Y})^2}{N-1+c}$ .
- With $\chi^2_1$ variables (skewness $\gamma = 2\sqrt{2}$ ), the optimal $c$ is $2/3$.
- Result: The denominator $N - 1 + 2/3 = N - 1/3$ is superior to $N$ (ML), $N-1$ (Unbiased), and $N+1$ (Min MSE). The estimator with denominator $N-1/3$ minimizes the expected number of $\varepsilon$ -errors.
Binomial Probability:
- The optimal estimator is $(Y_n + 2/3) / (n + 4/3)$ . This is identified as a second-order minimax sequence.
Squared Normal Mean:
- For estimating $\theta = \xi^2$ , the estimator $(\bar{X}_n)^2 + \sigma^2/n$ (corresponding to $d=-1$ ) is shown to be optimal, outperforming both the ML ( $d=0$ ) and UMV ( $d=1$ ) solutions.
Standard Deviation and Log-Scale:
- For estimating $\sigma$ directly, the optimal denominator is $N - 5/6$ .
- For estimating $\sigma$ on a log-scale, the optimal denominator is approximately $N - 0.695$ .

C. Distributional Limits (Second Order)

In Section 6, the authors move beyond expectations to the distribution of the difference.

While $\varepsilon^2(Q_{1,\varepsilon} - Q_{2,\varepsilon}) \to 0$ in probability, the scaled difference $\varepsilon(Q_{1,\varepsilon} - Q_{2,\varepsilon})$ converges in distribution to a random variable $A - B$ .
$A$ and $B$ are related to the total relative time Brownian motion spends near the boundaries $\pm s/\sigma$ .
These limiting variables follow exponential distributions or mixtures of exponentials and point masses at zero.

4. Significance and Implications

Refinement of Estimator Selection: The paper provides a rigorous framework for choosing between estimators that are asymptotically equivalent in the first order (same variance). It demonstrates that "best" depends on the specific loss function (counting $\varepsilon$ -errors) and the skewness of the data.
New Optimal Denominators: The results challenge standard textbook formulas. For example, the paper argues that for normal variance estimation, the denominator $N-1/3$ is theoretically superior to the unbiased $N-1$ or the ML $N$ when the goal is to minimize the frequency of large relative errors.
Connection to Brownian Motion: The work bridges discrete estimation problems with continuous stochastic processes (Brownian motion), offering a geometric interpretation of estimation error accumulation.
Decision Theoretic Insight: The criterion used ( $E[Q_\varepsilon]$ ) can be viewed as a decision-theoretic loss function where the cost is the total number of times the estimate falls outside an $\varepsilon$ -neighborhood. This offers a practical alternative to Mean Squared Error (MSE) for comparing estimators.
Skewness Sensitivity: A major theoretical contribution is showing that skewness ( $\gamma$ ) plays a direct role in second-order optimality, a factor often ignored in standard asymptotic efficiency comparisons.

In summary, Hjort and Fenstad establish that by looking at the frequency of errors rather than just the magnitude (variance), one can derive distinct, often counter-intuitive, optimal estimators that outperform traditional ML and UMV solutions in the limit of small error tolerance.

Second order asymptotics for the number of times an estimator is more than epsilon from its target value

The Problem: The Tie

The First Order vs. The Second Order

The Analogy: The "Miss Count"

The Big Discovery: The "Perfect" Denominator

Why Does This Matter?

The "Brownian Motion" Connection

Summary

1. Problem Statement

2. Methodology

3. Key Contributions and Results

A. General Theory for Mean Estimation

B. Specific Applications

C. Distributional Limits (Second Order)

4. Significance and Implications

More like this

Modeling extremal dependence in multivariate and spatial problems: a practical perspective

Identifying Treatment Effect Heterogeneity with Bayesian Hierarchical Adjustable Random Partition in Adaptive Enrichment Trials

Comparative e-backtests for general risk measures

Estimating the distance at which narwhal (Monodon monoceros)(\textit{Monodon monoceros})(Monodon monoceros) respond to disturbance: a penalized threshold hidden Markov model

Either a Confidence Interval Covers, or It Doesn't (Or Does It?): A Model-Based View of Ex-Post Coverage Probability

Estimating the distance at which narwhal $(\textit{Monodon monoceros})$ respond to disturbance: a penalized threshold hidden Markov model