When Can We Trust Cluster-Robust Inference?

Imagine you are a detective trying to solve a mystery: Does a specific treatment (like a new teaching method) actually change student behavior?

To solve this, you gather data from many students. But here's the catch: students aren't isolated islands. They sit in classrooms, attend schools, and live in neighborhoods. If you treat every student as an independent piece of evidence, you might be fooled. Students in the same classroom often influence each other; they share the same teacher, the same lunch, and the same mood. In statistics, we call these groups "clusters."

For decades, economists and social scientists have used a tool called "Cluster-Robust Inference" to handle this. It's like putting a safety net under your conclusions so you don't fall if the data is "clumpy."

However, James MacKinnon's paper argues that not all safety nets are created equal. Some are made of strong steel; others are made of wet tissue paper that rips the moment you step on them. The paper asks: How do we know which net to trust?

Here is the breakdown of the paper in simple terms, using some everyday analogies.

1. The Problem: The "Fake Confidence" Trap

Imagine you are betting on a coin flip. If you flip a coin 10 times, you might get 7 heads. You might think, "Wow, this coin is biased!" But if you only flipped it 10 times, that's just luck.

In statistics, when we have few clusters (like only 12 classrooms), our standard math tools often act like they have flipped the coin 1,000 times. They give us false confidence. They tell us, "We are 99% sure this treatment works!" when we are actually only 60% sure.

The paper explains that the old, standard way of calculating these numbers (called CV1) is like using a ruler that shrinks when it gets hot. It makes your "margin of error" look tiny, leading you to make bold claims that might be wrong.

2. The Better Tools: Stronger Safety Nets

MacKinnon suggests we stop using the shrinking ruler and switch to better tools. He highlights two main upgrades:

The "Jackknife" (CV3): Imagine you are trying to guess the weight of a giant pumpkin. Instead of weighing the whole thing at once, you take a slice out, weigh the rest, then take another slice out, and weigh that. You do this for every slice. By seeing how much the weight changes when you remove one piece, you get a much more honest estimate of the total weight.
- In the paper: This method removes one classroom at a time to see how much the results change. It usually gives a "wider" (more cautious) margin of error, which is safer.
The "Wild Cluster Bootstrap" (WCR-S): Imagine you are trying to predict the weather. Instead of just looking at today's temperature, you simulate 1,000 different "what-if" weather scenarios based on today's data to see how often it rains.
- In the paper: This method runs thousands of computer simulations to see how the results would look if the data were slightly different. It's a heavy-duty computer check that often catches errors the other methods miss.

3. The "Red Flags": When to Stop and Think

The paper warns that even the best tools can fail if the data is "weird." MacKinnon suggests checking for Red Flags before you trust any result:

The "Giant Cluster" Problem: Imagine you have 19 small groups of 50 people and 1 giant group of 10,000 people. If that giant group has a weird result, it will drag your whole conclusion off the rails. If your data has one massive cluster, be very skeptical.
The "One-Sided" Problem: If you are testing a new drug, and you only have 1 hospital giving the drug and 11 hospitals giving a placebo, your math is broken. You need a balance. If you have too few "treated" groups, no method can save you.
The "Heterogeneity" Check: Are all your groups basically the same? If one group is rich, one is poor, one is urban, and one is rural, they are too different to be compared easily. This "clumpiness" makes the math unreliable.

4. The Detective's Toolkit: How to Verify Your Results

So, how do you know which result to trust when you have a messy dataset? MacKinnon suggests a "Triangulation" approach. Don't just pick one number; run a few different tests:

The "Placebo" Test: Imagine you pretend that a variable that shouldn't matter (like the color of the students' shoes) is actually the treatment. Run your analysis on this fake data. If your method says, "Wow, shoe color definitely changes grades!" then your method is broken. It's finding patterns where there are none.
The "Targeted Simulation": Build a fake world that looks exactly like your real data, but where you know the answer is "No effect." Run your tests on this fake world. If your test says "Yes, there is an effect," then your method is lying to you.

5. The Real-World Examples

The paper tests these ideas on two real studies:

Female Role Models: A study on whether seeing successful women in class makes girls want to study economics. The data had very few classes (clusters). The old method said, "Definitely yes!" The new, cautious methods said, "Maybe, but we aren't 100% sure." The simulations showed the old method was likely too confident.
Poor Classmates in Delhi: A study on whether having poor students in a class changes volunteering habits. Here, the researchers had to decide: Do we group by school or by school-grade? The paper shows that grouping by school (fewer, bigger groups) was actually more reliable than grouping by school-grade (more, smaller groups), even though it seemed counter-intuitive.

The Bottom Line

Trust, but verify.

When you see a study claiming a "statistically significant" result based on clustered data:

Check the number of groups: If there are fewer than 20–30 groups, be skeptical.
Check the balance: Are there enough treated groups and control groups?
Look for the "Wild" methods: If the study uses the newer "Wild Cluster Bootstrap" or "Jackknife" methods, it's more likely to be trustworthy than if it uses the old standard errors.
Look for the "Placebo" check: Did the authors test their method on fake data to prove it works?

In short, statistics isn't just about crunching numbers; it's about knowing when your calculator is lying to you. MacKinnon's paper gives us the tools to spot those lies and find the truth.

1. Problem Statement

The paper addresses the pervasive challenge in econometrics of obtaining reliable statistical inferences (hypothesis tests and confidence intervals) when data exhibits clustering. While cluster-robust standard errors (CRSE) are standard practice in economics and other disciplines to account for arbitrary heteroskedasticity and intra-cluster correlation, existing methods often yield unreliable results in finite samples, particularly when:

The number of clusters ( $G$ ) is small.
There is significant heterogeneity across clusters (e.g., varying cluster sizes, leverage, or treatment status).
The number of treated or control clusters is unbalanced.

The core problem is that no single inferential method is universally reliable. Researchers often face conflicting $P$ -values and confidence intervals depending on the chosen variance estimator (CV1, CV2, CV3) or distributional assumption (Normal vs. $t(G-1)$ vs. Bootstrap), making it difficult to know which results to trust.

2. Methodology and Framework

The paper focuses on the linear regression model with one-way clustering:
$y_g = X_g\beta + u_g$
where $g = 1, \dots, G$ represents clusters.

A. Variance Matrix Estimators

The author reviews three primary Cluster-Robust Variance Estimators (CRVEs):

CV1 (Standard): Based on the empirical score vectors ( $\hat{s}_g = X_g'\hat{u}_g$ ). It is the most widely used but often performs poorly in finite samples, tending to underestimate variance.
CV2: A rescaled version analogous to HC2, which is unbiased under specific conditions but computationally expensive for large clusters.
CV3 (Cluster Jackknife): Based on the "leave-one-cluster-out" jackknife principle. It uses the variation in parameter estimates ( $\hat{\beta}_{(g)}$ ) when each cluster is omitted. The paper argues CV3 is generally superior to CV1, often yielding more conservative (larger) standard errors.

B. Inferential Distributions

Asymptotic Normality: Often inappropriate for small $G$ .
$t(G-1)$ Distribution: Conventional for CV1, but the paper notes it is an approximation.
Adjusted Degrees of Freedom: Methods by Hansen (2025a,b) and others that calculate specific scaling factors ( $\gamma_j$ ) and degrees of freedom ( $d_j$ ) for each coefficient to correct bias.

C. Bootstrap Methods

The paper evaluates several bootstrap approaches:

Pairs Cluster Bootstrap (PCB): Resamples clusters. Often performs poorly due to varying sample sizes and leverage in bootstrap samples.
Wild Cluster Bootstrap (WCB): Multiplies score vectors by random weights (e.g., Rademacher distribution).
- WCR-C: Restricted (imposes null hypothesis) and uses classic scores.
- WCU-S / WCR-S: "Score" variants that use modified score vectors ( $\dot{s}_g$ ) derived from the jackknife estimates to correct for distortions caused by least squares. These are highlighted as computationally efficient and highly reliable.

D. Diagnostic and Assessment Tools

To determine reliability, the author proposes a multi-step diagnostic framework:

Cluster Heterogeneity Measures:
- Effective Number of Clusters ( $G^*$ ): Adjusts $G$ based on leverage and intra-cluster correlation.
- Partial Leverage: Identifies if specific clusters dominate the estimation.
- Treatment Balance: Checks if the number of treated ( $G_1$ ) and control ( $G_0$ ) clusters is sufficient.
Score-Variance Tests: Tests to determine the correct level of clustering (e.g., school vs. school district) by comparing score variances under different clustering assumptions.
Targeted Monte Carlo Experiments: Simulations where the data generating process (DGP) mimics the specific dataset's $X$ matrix and cluster structure, varying only the error terms to estimate actual rejection frequencies.
Placebo Regressions: Simulations where the treatment variable is replaced by a random placebo variable (preserving the original $y$ ) to check if the inferential method falsely rejects the null hypothesis.

3. Key Contributions

Practical Diagnostic Protocol: The paper moves beyond theoretical asymptotics to provide a practical "checklist" for researchers. It argues that reliability is not guaranteed by a single method but must be assessed via diagnostics (leverage, $G^*$ ) and simulation (Monte Carlo/Placebo).
Superiority of CV3 and WCR-S: The author provides strong evidence that CV3 (Cluster Jackknife) and the WCR-S (Restricted Wild Cluster Bootstrap with Score correction) generally outperform the standard CV1 and classic WCR-C methods, especially in difficult finite-sample scenarios.
Clarification on Small $G$ and Imbalance: The paper rigorously demonstrates that when the number of treated clusters is very small, all methods may fail. Specifically, unrestricted methods tend to over-reject, while restricted methods (like WCR-S and Hansen's procedure) tend to under-reject.
Two-Way Clustering: It discusses the complexities of two-way clustering, noting that standard errors can be undefined if the estimated variance matrix is not positive definite, and suggests taking the maximum of one-way and two-way standard errors as a robust heuristic.

4. Empirical Results

The paper applies these methods to two case studies:

Case 1: Female Role Models in Economics (Porter & Serra, 2020):
- Context: 12 classes, 4 treated. Very small $G$ and highly unbalanced treatment.
- Findings: Standard CV1 methods yielded misleadingly small $P$ -values. Targeted Monte Carlo and Placebo regressions revealed that most methods over-rejected. The results suggested that the evidence for a treatment effect was modest, contrary to the strong significance often reported with standard methods.
Case 2: Diversity in Elite Delhi Schools (Rao, 2019):
- Context: 17 schools, 68 school-grade clusters. Treatment was not random at the school-grade level.
- Findings: Score-variance tests indicated that clustering at the school level (17 clusters) was more appropriate than school-grade level. Despite the small number of clusters, the WCR-S bootstrap and Hansen's method provided consistent, reliable $P$ -values (approx. 0.001–0.004), confirming a strong effect of poor classmates on volunteering.

5. Significance and Conclusion

The paper concludes that there is no "silver bullet" for cluster-robust inference. Reliability depends heavily on the specific data structure (cluster size, heterogeneity, treatment balance).

Key Takeaways for Practitioners:

Do not trust CV1 blindly: It frequently underestimates standard errors in small samples.
Use Diagnostics: Calculate effective cluster counts ( $G^*$ ) and leverage measures before running tests.
Prefer Robust Methods: Use CV3 (cluster jackknife) or Hansen's adjusted $t$ -tests for analytical inference. Use WCR-S for bootstrap inference.
Validate with Simulation: When results are critical or conflicting, perform Targeted Monte Carlo or Placebo Regressions specific to the dataset to estimate the true rejection frequency.
Acknowledge Limits: If the number of treated clusters is extremely small (e.g., 1 or 2), reliable inference may be impossible regardless of the method used.

The paper serves as a critical guide for moving from "blind application" of cluster-robust standard errors to a more rigorous, diagnostic-driven approach to statistical inference.