On the last time and the number of times an estimator is more than epsilon from its target value

Imagine you are trying to hit a bullseye on a dartboard, but the board is moving slightly, and your aim gets better the more darts you throw. You want to know two specific things:

The "Last Miss": After how many throws will you never miss the bullseye again?
The "Total Misses": How many times in total will you miss the bullseye before you finally get good enough to hit it every single time?

This paper, written by statisticians Nils Lid Hjort and Grete Fenstad, is all about answering these questions for mathematical estimators (which are just fancy ways of guessing a true value based on data).

Here is the breakdown of their findings using simple analogies.

1. The Core Concept: The "Last Miss" and "Total Misses"

In statistics, we often use data to guess a true number (like the average height of all people). As we collect more data (more darts), our guess gets closer to the truth.

Strong Consistency: This just means that if you keep throwing darts forever, you will eventually hit the bullseye and stay there.
The Problem: We know we will eventually hit the bullseye, but we don't know when the last time we miss will be, or how many times we will miss in total.

The authors ask: If we define a "miss" as being more than a tiny distance ( $\epsilon$ ) away from the truth, what is the distribution of the last time we miss, and the total count of misses?

2. The Big Discovery: The "Brownian Motion" Dance

The authors found a surprising pattern. When you zoom in on these "misses" as the allowed error gets smaller and smaller, the behavior of the estimator looks like a specific type of random dance called Brownian Motion (think of a drunk person stumbling around a street).

They discovered that if you scale the "Last Miss" number by the square of the error size, it settles into a predictable pattern.

The Analogy: Imagine you are timing how long a drunk person wanders outside a specific circle before they finally stay inside forever. The paper says that no matter how you are walking (as long as you are generally heading toward the center), the time you spend outside follows a specific mathematical rule based on the "drunkard's walk."

3. The "Gold Standard": Maximum Likelihood Estimators

In the world of statistics, there is a "gold standard" way to guess values called the Maximum Likelihood Estimator (MLE). It's the most popular method because it usually gives the best guess.

The paper proves something very cool: The MLE is the fastest runner.

The Analogy: Imagine a race where runners are trying to stay inside a shrinking tunnel. The MLE is the runner who, statistically speaking, stays inside the tunnel sooner than anyone else.
The Result: No other method of guessing can guarantee that you will stop making "big mistakes" faster than the MLE. If you use a different method, you might get lucky sometimes, but on average, you will keep missing the target longer.

4. Different Scenarios, Different Rules

The paper isn't just about simple averages; it looks at complex situations:

The "Empirical Distribution" (The Glivenko–Cantelli Theorem):
Imagine you are trying to draw a map of a city based on random sightings. The paper looks at the last time your map looks significantly different from the real city. They found that the "last miss" for this map-drawing process follows a specific, complex pattern involving a "Kiefer Process" (a 2D version of the drunkard's walk). They proved that the standard way of drawing this map is actually the best possible way to stop making big errors.
Density Estimation (Smoothing Data):
Imagine you have a pile of sand and you want to guess the shape of the hill underneath. You use a "kernel" (a smoothing tool) to smooth out the sand.
- The Twist: In this specific case, the "Last Miss" doesn't follow the standard rule. It follows a different power law.
- The Surprise: The paper calculates that the "best" smoothing tool isn't the one everyone traditionally uses. It suggests tweaking the tool by a tiny amount (multiplying by 1.008) to minimize the total number of misses. It's like finding that your favorite recipe needs exactly 1.008 cups of flour instead of 1 cup to be perfect.

5. Why Does This Matter?

You might ask, "Who cares about the last time I miss?"

Comparing Tools: It gives statisticians a new, fair way to compare two different guessing methods. Instead of just looking at the average error, you can ask: "Which method stops making big mistakes sooner?"
Sequential Testing: It helps in designing experiments where you stop collecting data as soon as you are confident enough. The paper shows how to build "confidence sets" (safe zones) that shrink over time, guaranteeing you are right with 100% certainty eventually.
Power 1 Tests: It helps create tests that are guaranteed to detect a problem if one exists, eventually.

Summary

This paper is a deep dive into the "end game" of statistical estimation. It moves beyond asking "How accurate is the average guess?" to asking "How long do we have to wait until we are never wrong again?"

The main takeaways are:

The time until you stop making big mistakes follows a predictable pattern based on random walks.
The standard "Maximum Likelihood" method is the champion of speed; it stops making mistakes faster than any other method.
For specific complex problems (like smoothing data), the "best" settings are slightly different from what people usually think, and the authors found the exact numbers.

It turns the abstract concept of "convergence" (getting closer and closer) into a concrete story about counting misses and timing the final victory.

Here is a detailed technical summary of the paper "On the last time and the number of times an estimator is more than $\varepsilon$ from its target value" by Nils Lid Hjort and Grete Fenstad (1991).

1. Problem Statement

The paper addresses a fundamental question in statistical asymptotics regarding the speed of convergence of strongly consistent estimators. While standard theory establishes that an estimator $\hat{\theta}_n$ converges almost surely (a.s.) to a true parameter $\theta_0$ , it does not quantify how long the estimator remains outside a specific $\varepsilon$ -neighborhood of the target.

The authors define two key random variables to measure this:

$N_\varepsilon$ (The Last Time): The last sample size $n$ for which the estimator deviates from the target by at least $\varepsilon$ :
$N_\varepsilon = \sup\{n \ge 1 : \|\hat{\theta}_n - \theta_0\| \ge \varepsilon\}$
By strong consistency, $N_\varepsilon$ is finite with probability 1.
$Q_\varepsilon$ (The Number of Misses): The total number of times the estimator deviates from the target by at least $\varepsilon$ :
$Q_\varepsilon = \sum_{n=1}^{\infty} I(\|\hat{\theta}_n - \theta_0\| \ge \varepsilon)$

The primary goal is to derive the limiting distributions of these variables (specifically scaled versions like $\varepsilon^2 N_\varepsilon$ and $\varepsilon^2 Q_\varepsilon$ ) as $\varepsilon \to 0$ . This provides a probabilistic characterization of the "tail" behavior of the estimator sequence.

2. Methodology

The authors employ stochastic process convergence techniques, specifically leveraging Donsker's Theorem and the properties of Brownian motion.

Representation of Estimators: The analysis relies on the assumption that the estimator admits a linear representation with a residual term:
$\hat{\theta}_n - \theta_0 = \Sigma_0^{1/2} \frac{1}{n} \sum_{i=1}^n Z_i + R_n$
where $Z_i$ are i.i.d. with mean 0 and identity covariance, and the residual noise $R_n$ satisfies a specific decay condition ( $D_m = \sqrt{m} \sup_{n \ge m} \|R_n\| \xrightarrow{p} 0$ ).
Scaling and Transformation: By setting $m \approx y/\varepsilon^2$ , the problem of finding the distribution of $N_\varepsilon$ is transformed into studying the supremum of a normalized partial sum process.
Weak Convergence: The core argument uses the fact that the normalized partial sum process $\sqrt{m}(\hat{\theta}_{[mt]} - \theta_0)$ converges in distribution to a scaled Brownian motion process divided by time, $\sigma_0 W(t)/t$ .
Continuous Mapping Theorem: The limit distributions are derived by applying continuous functionals (specifically the supremum over $t \ge 1$ ) to these limiting Gaussian processes.
Tail Bounds: To ensure convergence over the infinite horizon $[1, \infty)$ , the authors utilize specific tail inequalities (generalizations of Kolmogorov's inequality) to prove that the probability of the process exceeding a threshold for very large $n$ vanishes.

3. Key Contributions and Results

A. Parametric Estimation (i.i.d. Case)

Limit Distribution of $N_\varepsilon$ : For a one-dimensional estimator with asymptotic variance $\sigma_0^2$ , the scaled variable converges to the square of the maximum of a standard Brownian motion over the unit interval:
$\varepsilon^2 N_\varepsilon \xrightarrow{d} \sigma_0^2 W_{\max}^2, \quad \text{where } W_{\max} = \sup_{0 \le s \le 1} |W(s)|$
Multivariate Extension: For $p$ -dimensional parameters with distance metric $\|\cdot\|$ , the limit is the square of the supremum of a multivariate Gaussian process:
$\varepsilon^2 N_\varepsilon \xrightarrow{d} \left( \sup_{0 \le s \le 1} \|\Sigma_0^{1/2} W(s)\| \right)^2$
If the distance is the Mahalanobis distance, the limit is $\chi^2_{p, \max} = \sup_{0 \le s \le 1} \sum_{i=1}^p W_i(s)^2$ .
Number of Misses ( $Q_\varepsilon$ ): The total number of misses also has a limiting distribution related to the Lebesgue measure of the time the process stays outside the boundary.
$\varepsilon^2 Q_\varepsilon \xrightarrow{d} \sigma_0^2 \mu\{t \ge 0 : |W(t)/t| \ge 1\}$
The expected number of misses is proportional to the asymptotic variance.

B. Asymptotic Optimality of Maximum Likelihood Estimators (MLE)

A major theoretical contribution is establishing a new optimality property for the MLE:

Stochastic Dominance: Among all strongly consistent estimator sequences, the MLE sequence (under regularity conditions) minimizes the limit distribution of $N_\varepsilon$ and $Q_\varepsilon$ stochastically.
Interpretation: No other estimator sequence enters and stays within an $\varepsilon$ -neighborhood of the true parameter "faster" (in terms of the last exit time) or with "fewer excursions" than the MLE.
Asymptotic Relative Efficiency (ARE): The paper provides a fresh justification for the traditional ARE measure (ratio of inverse variances). The ratio of the expected number of misses (or medians of $N_\varepsilon$ ) between two estimators converges to the ratio of their asymptotic variances:
$\lim_{\varepsilon \to 0} \frac{E[N_{\varepsilon, 1}]}{E[N_{\varepsilon, 2}]} = \frac{\sigma_1^2}{\sigma_2^2}$

C. Nonparametric and Other Applications

Empirical Distribution Function (Glivenko-Cantelli): For the supremum distance $\|F_n - F\|$ , the limit distribution involves a Kiefer process $K(s,t)$ .
$\varepsilon^2 N_\varepsilon \xrightarrow{d} K_{\max}^2 = \left( \sup_{0 \le s \le 1, 0 \le t \le 1} |K(s,t)| \right)^2$
This establishes the optimality of the empirical distribution function in terms of the last time it deviates from the true CDF.
Nonparametric Density Estimation: For kernel density estimators $f_n(x)$ $f_{n} (x)$ , the scaling changes due to the bias-variance trade-off. The relevant scaling is $\varepsilon^{5/2} N_\varepsilon$ $ε^{5/2} N_{ε}$ .
- The paper derives the optimal bandwidth constant for minimizing the expected number of misses. Interestingly, the optimal constant is 1.008 times the traditional constant derived from minimizing Mean Squared Error (MSE).
Moments Convergence: The paper proves that the moments of the scaled variables converge to the moments of the limit distributions (e.g., $E[\varepsilon^2 N_\varepsilon] \to 2G\sigma_0^2$ , where $G$ is Catalan's constant).

4. Significance and Implications

New Metric for Estimator Comparison: The paper moves beyond pointwise convergence rates to provide a global, path-wise metric for comparing estimators. It allows statisticians to compare methods based on the "tail" behavior of their convergence paths.
Sequential Procedures: The results have direct applications in constructing sequential fixed-volume confidence sets and shrinking-volume confidence regions. By knowing the distribution of the "last time" an estimator is far from the truth, one can design stopping rules that guarantee coverage with high probability.
Power 1 Tests: The theory supports the construction of sequential tests with power 1 (guaranteed to reject false null hypotheses eventually).
Second-Order Optimality: While the first-order limit distributions (based on variance) are identical for many efficient estimators, the paper hints at (and references subsequent work on) second-order differences. This allows for the selection of estimators that minimize the expected number of errors even when asymptotic variances are equal.
Robustness: The results extend to non-i.i.d. situations (e.g., linear regression, auto-correlated data) and various distance measures (including Kullback-Leibler divergence), demonstrating the robustness of the stochastic process approach.

In summary, Hjort and Fenstad provide a rigorous probabilistic framework for quantifying the "worst-case" deviation of estimators over time, linking these deviations directly to the asymptotic variance and establishing the Maximum Likelihood Estimator as the stochastically optimal choice for minimizing the duration and frequency of estimation errors.

On the last time and the number of times an estimator is more than epsilon from its target value

1. The Core Concept: The "Last Miss" and "Total Misses"

2. The Big Discovery: The "Brownian Motion" Dance

3. The "Gold Standard": Maximum Likelihood Estimators

4. Different Scenarios, Different Rules

5. Why Does This Matter?

Summary

1. Problem Statement

2. Methodology

3. Key Contributions and Results

A. Parametric Estimation (i.i.d. Case)

B. Asymptotic Optimality of Maximum Likelihood Estimators (MLE)

C. Nonparametric and Other Applications

4. Significance and Implications

More like this

Modeling extremal dependence in multivariate and spatial problems: a practical perspective

Identifying Treatment Effect Heterogeneity with Bayesian Hierarchical Adjustable Random Partition in Adaptive Enrichment Trials

Comparative e-backtests for general risk measures

Estimating the distance at which narwhal (Monodon monoceros)(\textit{Monodon monoceros})(Monodon monoceros) respond to disturbance: a penalized threshold hidden Markov model

Either a Confidence Interval Covers, or It Doesn't (Or Does It?): A Model-Based View of Ex-Post Coverage Probability

Estimating the distance at which narwhal $(\textit{Monodon monoceros})$ respond to disturbance: a penalized threshold hidden Markov model