ForeComp: An R Package for Comparing Predictive Accuracy Using Fixed-Smoothing Asymptotics

Imagine you are a coach trying to decide which of two star players is better at predicting the weather for next week. Player A uses a high-tech satellite model, and Player B uses a simple rule of thumb: "It will be the same as today."

You watch them make predictions for a month. To see who wins, you count up their mistakes. But here's the catch: How do you decide if Player A is truly better, or if they just got lucky?

This is the problem the paper ForeComp solves. It introduces a new toolkit (an R package) to help economists and data scientists compare forecasts without getting fooled by bad math.

Here is the breakdown in simple terms, using some everyday analogies.

1. The Old Way: The "Ruler" That Was Too Short

For decades, the standard way to compare forecasts was the Diebold-Mariano (DM) test. Think of this test like a ruler used to measure a long, winding river.

The Problem: The old ruler assumed the river was straight and short. It only looked at the immediate past (the last few days) to predict the future.
The Reality: In the real world, weather (and economies) are messy. Mistakes often "echo." If you make a mistake today, you might make a similar mistake tomorrow. This is called serial correlation.
The Consequence: Because the old ruler was too short, it often said, "Look! Player A is way better!" when they were actually just as good as Player B. It was over-confident. In statistics, we call this a "size distortion"—the test thinks it's 95% sure, but it's actually only 80% sure.

2. The New Solution: The "Flexible Tape Measure"

The ForeComp package introduces Fixed-Smoothing Asymptotics.

The Analogy: Imagine instead of a rigid ruler, you have a flexible measuring tape that stretches to fit the whole river, no matter how winding it is.
How it works: Instead of ignoring the "echoes" of past mistakes, this new method acknowledges them. It uses a wider "window" of data to calculate the uncertainty.
The Result: It's more honest. It says, "Well, the data is messy, so I'm not quite as sure that Player A is better." This prevents the test from crying "Wolf!" when there is no wolf.

3. The "Tradeoff" Dashboard (Plot Tradeoff)

One of the coolest features of this package is a visual tool called Plot Tradeoff.

The Metaphor: Imagine you are tuning a radio.
- If you turn the dial too far left (too little data), the signal is fuzzy, and you might hear static (false alarms).
- If you turn it too far right (too much data), the signal is clear, but you might miss a faint, new station (you miss a real discovery).
What the Tool Does: The Plot Tradeoff draws a map for you. It shows you exactly what happens if you change the "dial" (the bandwidth).
- Red X: "If you use this setting, you will reject the null (say someone is better)."
- Red Circle: "If you use this setting, you won't reject."
- Green Dot: The "sweet spot" recommended by the authors.
Why it matters: It stops you from cherry-picking a setting just to get the result you want. It shows you if your conclusion is robust (holds up no matter how you tune the dial) or fragile (falls apart if you tweak it slightly).

4. The "Reality Check" (The Experiments)

The authors didn't just write theory; they ran thousands of computer simulations (like a video game where they played the game 5,000 times with different rules).

The Finding: The old "short ruler" method (Standard DM) was over-rejecting the null hypothesis. It was declaring winners far too often in small samples.
The Winner: The new "flexible tape" methods (Fixed-Smoothing) kept the error rate exactly where it should be (5%). Crucially, they didn't lose their ability to find real winners. They were honest but still sharp.

5. Real-World Application: The Weather Forecasters

The authors tested this on real data from the Survey of Professional Forecasters (a group of economists who predict the US economy).

Scenario: They compared the experts' predictions against a "no-change" guess (assuming the economy stays the same).
Result: Sometimes the old method said, "The experts are amazing!" The new method said, "Actually, they aren't significantly better than a simple guess."
Lesson: In small datasets (like looking at just the last 10 years of data), the old method was lying to us. The new method gave a more reliable answer.

Summary

ForeComp is like a new, smarter referee for the game of prediction.

Old Referee: Blows the whistle too easily, thinking every small advantage is a win.
New Referee (ForeComp): Waits to see if the advantage is real, accounting for the "noise" and "echoes" in the data. It also gives you a dashboard to see if the call holds up under pressure.

If you are trying to decide if your new AI model is actually better than the old one, or if your financial strategy beats the market, this paper tells you: Don't trust the old math. Use the new toolkit to avoid fooling yourself.

Here is a detailed technical summary of the paper "ForeComp: An R Package for Comparing Predictive Accuracy Using Fixed-Smoothing Asymptotics."

1. Problem Statement

Comparing the predictive accuracy of competing forecasts is a fundamental task in econometrics, typically performed using the Diebold–Mariano (DM) test. The standard DM test assesses whether the expected loss differential between two forecasts is zero.

However, the paper identifies a critical flaw in the standard DM test: severe size distortions in finite samples, particularly when the evaluation sample size ( $P$ ) is small.

Root Cause: The distortion arises from the estimation of the long-run variance of the loss differential. Standard approaches often rely on small bandwidths (e.g., truncating at $h-1$ lags based on the assumption of an $MA(h-1)$ process) and standard normal critical values.
Consequence: In practice, loss differentials often exhibit serial correlation beyond $h-1$ lags due to suboptimal forecasts, parameter estimation uncertainty, or model misspecification. Standard tests fail to account for this, leading to over-rejection of the null hypothesis (false positives) in short samples.

2. Methodology

The paper introduces ForeComp, an R package that consolidates standard inference with fixed-smoothing asymptotics to address these size distortions.

A. Theoretical Framework

The package implements tests based on the loss differential $d_{t+h}$ . The core challenge is estimating the long-run variance $\sigma^2$ . The paper contrasts two asymptotic regimes:

Standard Asymptotics: The bandwidth $M$ grows with sample size $P$ such that $M/P \to 0$ . This relies on standard normal ( $N(0,1)$ ) or $t$ -distributions.
Fixed-Smoothing Asymptotics: The bandwidth ratio $b = M/P$ (or $m/P$ ) is held fixed as $P \to \infty$ . This approach, developed by Kiefer and Vogelsang (2005) and Sun (2013), acknowledges that the variance estimator has significant estimation error in finite samples. The resulting limiting distributions are non-standard (e.g., functionals of Brownian motion) and depend on the smoothing parameter.

B. Implemented Procedures

ForeComp provides a unified interface for several testing procedures (summarized in Table 1 of the paper):

Standard Procedures:
- DM-R: Original DM test with rectangular kernel and $M=h-1$ .
- DM-M: Harvey et al. (1997) modification with bias correction and $t$ -distribution critical values.
- DM-NW: Bartlett kernel with Newey-West (1994) bandwidth selection and normal critical values.
Fixed-Smoothing & Alternative Procedures:
- DM-FB: Bartlett kernel with fixed- $b$ asymptotics. Uses a larger bandwidth ( $M \approx 1.3\sqrt{P}$ ) and critical values derived from the fixed- $b$ distribution (approximated by a polynomial in $b$ ).
- DM-EWC: Equal-Weighted Cosine (EWC) estimator (orthonormal series) with fixed- $b$ asymptotics. Critical values follow a $t_B$ distribution.
- DM-WPE: Weighted Periodogram Estimator with fixed- $m$ asymptotics (Daniell kernel). Critical values follow a $t_{2m}$ distribution.
- DM-IM: Ibragimov–Müller clustering approach, which divides the sample into blocks and avoids explicit long-run variance estimation.

C. Diagnostic Tool: Plot Tradeoff

A key methodological contribution is the Plot Tradeoff function. This visual diagnostic helps practitioners navigate the size–power tradeoff inherent in bandwidth selection.

It plots the test statistic and rejection decision across a grid of bandwidth values.
It estimates size distortion (empirical rejection rate under the null) and maximum power loss relative to an oracle test.
This allows users to see if a rejection is robust across bandwidths or if it is an artifact of a specific (potentially suboptimal) bandwidth choice.

3. Key Contributions

Software Implementation: ForeComp is the first R package to provide a common interface for both standard and fixed-smoothing DM tests, making advanced inference accessible to empirical researchers.
Unified Diagnostic: The Plot Tradeoff tool provides a practical solution to the "bandwidth sensitivity" problem, allowing researchers to justify their bandwidth choices or identify robust findings.
Empirical Replication: The authors replicate two major studies (Stark, 2010; Coroneo and Iacone, 2020) using the Survey of Professional Forecasters (SPF) data, demonstrating the package's utility in real-world macroeconomic forecasting.
Comprehensive Simulation Evidence: A large-scale Monte Carlo study (based on McCracken, 2019) systematically evaluates the finite-sample performance of all implemented methods.

4. Results

A. Empirical Applications

Application 1 (Stark, 2010): Replicates SPF forecast evaluation. Results show the SPF consistently outperforms the "no-change" benchmark but performs similarly to autoregressive benchmarks at longer horizons.
Application 2 (Coroneo & Iacone, 2020): Demonstrates that standard normal-approximation tests (WCE-DM) often yield spuriously significant results in small samples. In contrast, fixed-smoothing methods (WCE-B, WPE-D) frequently fail to reject the null where standard tests do, suggesting the standard rejections were due to size distortion.
Application 3 (Bandwidth Sensitivity): Using Plot Tradeoff, the authors show that in some cases, rejections occur only at very small bandwidths (high power, high size distortion) and vanish as bandwidth increases. This highlights the fragility of conclusions based on standard tests in small samples.

B. Monte Carlo Simulation

The simulation study compares empirical size and size-corrected power across two Data Generating Processes (DGPs):

Unconditional-Rolling (UCR): Loss differentials have serial correlation extending far beyond the forecast horizon (challenging for standard tests).
Conditional-Rolling (CR): Loss differentials follow the theoretical $MA(h-1)$ structure (favorable for standard tests).

Key Findings:

Size Control: In small samples ( $P=75$ ), standard procedures (DM-R, DM-NW) suffer from massive over-rejection (e.g., rejection rates of 12–16% at a 5% nominal level). Fixed-smoothing methods (DM-FB, DM-EWC) maintain size close to the nominal 5% level across all scenarios.
Power: Crucially, the improved size control of fixed-smoothing methods does not come at the cost of power. When size is corrected, fixed-smoothing tests (DM-FB, DM-EWC) exhibit size-corrected power comparable to, and often higher than, standard tests.
Bandwidth vs. Critical Values: The study isolates that simply increasing the bandwidth (DM-NW-L) without changing critical values still leads to over-rejection. The combination of large bandwidth + fixed-smoothing critical values is necessary for reliable inference.

5. Significance

The paper provides a critical correction to the standard practice of forecast comparison in economics and finance.

Reliability: It demonstrates that standard DM tests are unreliable in the small samples typical of macroeconomic evaluation (e.g., quarterly data over 10–20 years).
Best Practice: The authors recommend that researchers prioritize fixed-smoothing results (specifically DM-FB or DM-EWC) over standard normal-approximation tests when sample sizes are limited.
Transparency: The Plot Tradeoff diagnostic encourages transparency by forcing researchers to confront the sensitivity of their results to bandwidth choices, distinguishing robust findings from statistical artifacts.

In summary, ForeComp equips researchers with the tools to perform more rigorous, size-controlled predictive accuracy tests, reducing the risk of false positives in empirical forecasting literature.