When correcting for regression to the mean is worse… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a coach trying to figure out if your training program works differently for different athletes. You measure their speed before training ( $x_1$ ) and after training ( $x_2$ ). You want to know: Do the slowest runners improve the most? (This is called "compensatory growth" or "catch-up").

To answer this, you calculate the "change" (how much faster they got) and see if it relates to their starting speed.

Here is the problem: Your stopwatch is slightly shaky. Sometimes you record a runner as slower than they really are just because of a bad start or a gust of wind. This is Measurement Error.

Because of this shaky stopwatch, a statistical illusion called Regression to the Mean (RTM) happens. It's like a magic trick where the data pretends to show a relationship that isn't really there. If you pick the slowest runners (who likely had a bad day or a bad measurement), they will naturally look faster the next time you measure them, even if you do nothing. They aren't actually improving; they are just "regressing" back to their average speed.

This paper argues that trying to "fix" this illusion with standard math tricks often makes things worse.

The Three Characters in This Story

The authors analyze three ways scientists try to solve this problem:

1. The "Crude" Slope (The Unadjusted View)

This is just looking at the raw data without doing anything fancy.

The Problem: Because of the shaky stopwatch, the data will almost always show a fake negative line. It will look like the slowest runners improved the most, even if they didn't.
The Analogy: Imagine looking at a reflection in a funhouse mirror. The image is distorted, but at least you know you're looking at a distorted image.

2. The "Berry et al." Method (The Popular Fix)

This is the method most ecologists and biologists currently use. It tries to mathematically "subtract" the illusion from the data.

The Paper's Verdict: This is dangerous.
The Analogy: Imagine you have a blurry photo of a cat. The Berry method is like using a filter that tries to sharpen the image, but it doesn't know how blurry the photo is.
- If the photo is only slightly blurry, the filter might sharpen it too much, turning the cat into a tiger (creating a fake biological discovery).
- If the photo is very blurry, the filter might erase the cat entirely, making it look like there was no animal there at all (hiding a real discovery).
The Result: This method is unreliable. It often creates "fake" scientific findings or hides real ones, leading to wrong conclusions about how animals grow or age.

3. The "Blomqvist" Method (The Perfect Fix)

This method is mathematically perfect. It can remove the illusion completely.

The Catch: To use it, you need to know exactly how shaky your stopwatch is (the measurement error).
The Analogy: This is like having a filter that can perfectly restore a blurry photo, but only if you know the exact model of the camera lens that caused the blur.
The Problem: In real life, we rarely know the exact "shakiness" of our measurements. Also, if you have a small group of athletes (a small sample size), this method gets very jittery and unstable. It's like trying to balance a pencil on its tip; it's theoretically possible, but in practice, it falls over easily.

The Authors' Solution: The "Reality Check"

Instead of trying to magically fix the data (which often introduces new errors), the authors suggest a smarter approach: The Reality Check.

Don't try to calculate the "perfect" answer. Instead, ask: "Is my result so strong that it couldn't possibly be explained just by my shaky stopwatch?"

Estimate your "Repeatability": How consistent is your measurement? If you measure the same lizard's heat tolerance twice, do you get the same number? If the answer is "not very," your "Repeatability" is low.
Calculate the "Fake" Slope: Based on how shaky your measurement is, calculate what the slope should look like if there were zero real biological effect. (The paper calls this the "structural null").
Compare:
- If your observed data looks just like the "Fake" slope, then stop. You haven't found a biological truth; you've just found the noise of your measurement.
- If your data is way different from the "Fake" slope, then you might have found something real.

Real-World Examples from the Paper

Lizards: Scientists thought lizards with high heat tolerance couldn't get any better (a trade-off). The paper shows that this "trade-off" might just be the result of measurement noise. If the lizards' heat tolerance is hard to measure precisely, the data will look like they can't improve, even if they can.
Birds: Scientists thought birds with long telomeres (a marker of aging) lost them faster. The paper shows that when you account for the "shaky" measurement, the evidence for this rule disappears. It might just be statistical noise.

The Big Takeaway

"Correction" is not always better than "No Correction."

If you don't know how precise your measurements are, trying to "correct" for regression to the mean is like trying to fix a leaky roof by painting the ceiling. You might make the ceiling look nice, but the roof is still leaking, and now you've added paint to the mess.

The authors' advice:

Stop blindly applying the "Berry" correction.
Focus on Repeatability. Before you claim a biological discovery, you must know how reliable your measurements are.
If you can't measure your error precisely, admit the uncertainty. Don't claim you found a "trade-off" or a "compensatory growth" if your data could easily be explained by a shaky ruler.

In short: Know your tools before you try to fix the picture.

1. Problem Statement

The paper addresses a pervasive statistical challenge in ecology, physiology, and evolutionary biology: Regression to the Mean (RTM). Researchers frequently analyze the relationship between an individual's initial state ( $x_1$ ) and its subsequent change ( $d = x_2 - x_1$ ) to infer differential treatment effects (e.g., do individuals with low baseline tolerance improve more than those with high baseline?).

The core problem is that the observed relationship is confounded by two phenomena:

Mathematical Coupling: The independent variable ( $x_1$ ) is a component of the dependent variable ( $d$ ), creating a structural negative correlation even if no biological relationship exists.
Measurement Error: Random error in measuring $x_1$ and $x_2$ causes extreme initial values to regress toward the population mean in subsequent measurements, creating a spurious negative association.

Current State of the Field:

Researchers often attempt to "correct" for RTM using methods popularized by Berry et al. (1984) and Kelly & Price (2005).
Alternatively, some use Blomqvist (1977) methods, which require external estimates of measurement repeatability.
The Gap: The authors argue that these correction methods are often applied without a rigorous structural understanding of the underlying biological processes, leading to systematic biases, inflated Type I/II error rates, and unreliable biological conclusions.

2. Methodology

The authors develop a Structural Linear Model to explicitly separate true biological signals from statistical artifacts.

A. The Structural Model

They define a population of individuals with true (unobserved) values $X_1$ and $X_2$ .

True Change: $D = X_2 - X_1 = \alpha + \beta X_1 + \zeta$ $D = X_{2} - X_{1} = α + β X_{1} + ζ$
- $\beta$ : The parameter of interest, representing the differential treatment effect (how the treatment impact depends on the initial state).
- $\zeta$ : Stochastic biological noise with variance $\nu^2$ .
Observed Values: $x_1 = X_1 + \epsilon_1$ $x_{1} = X_{1} + ϵ_{1}$ and $x_2 = X_2 + \epsilon_2$ $x_{2} = X_{2} + ϵ_{2}$
- $\epsilon$ : Measurement error with variance $\delta^2$ .
Repeatability ( $R$ ): Defined as the proportion of total variance due to true individual differences: $R = \frac{\gamma^2}{\gamma^2 + \delta^2}$ .

B. Analytical Evaluation

The authors derive the expected behavior of three estimators under this model:

Crude Slope ( $\beta_c$ ): The regression of observed change ( $d$ ) on observed initial value ( $x_1$ ).
Berry et al. Slope ( $\beta_B$ ): A correction based on observed correlations ( $\rho$ ) between $x_1$ and $x_2$ .
Blomqvist Slope ( $\beta_e$ ): A correction based on known measurement error variance ( $\delta^2$ ) or repeatability ( $R$ ).

C. Simulation and Empirical Validation

Simulations: They generated synthetic data (using systolic blood pressure parameters) to test estimator performance across varying ratios of measurement error to between-subject variance ( $\delta^2/\gamma^2$ ) and sample sizes ( $N$ ).
Bootstrap Analysis: Used to generate confidence intervals for observed slopes to test against structural null hypotheses.
Case Studies: Re-analyzed two empirical datasets:
1. Lizard Thermal Physiology: Heat tolerance plasticity in Anolis carolinensis.
2. Bird Telomere Dynamics: Telomere attrition rates in Eurasian blue tits (Cyanistes caeruleus).

3. Key Contributions and Results

A. Failure of the Berry et al. Method

The authors demonstrate that the widely used Berry et al. correction is structurally flawed for hypothesis testing:

Bias: The method assumes a "steady-state" variance ( $V(x_1) = V(x_2)$ ) which rarely holds in biological systems with stochastic noise ( $\nu^2 > 0$ ).
Directional Errors:
- If the true effect $\beta > -1$ , the method over-corrects, pushing estimates toward zero.
- If $\beta < -1$ , the method exacerbates the bias.
Hypothesis Testing Failure: Under the null hypothesis ( $\beta = 0$ ), the corrected slope $\beta_B$ is not zero but a negative value dependent on biological noise. Testing $\beta_B$ against zero leads to high Type I error rates. Conversely, high measurement error causes $\beta_B$ to converge to zero, masking real effects (Type II error).

B. Limitations of the Blomqvist Method

While the Blomqvist method is theoretically unbiased (it recovers the true $\beta$ if $\delta^2$ is known), it suffers from high sampling variance:

In small-to-moderate sample sizes (typical in ecology, $N < 50$ ), the variance of the Blomqvist estimator is approximately twice that of the crude slope.
This high variance often makes the corrected estimate less precise and more likely to produce extreme outliers than the uncorrected "crude" slope.

C. The Proposed Solution: Structural Null Testing

The authors argue that correction is often worse than no correction. Instead of trying to "fix" the data, researchers should:

Evaluate the Crude Slope ( $\beta_c$ ): Accept that $\beta_c$ is biased.
Define the Structural Null: Under the hypothesis of no differential treatment ( $\beta = 0$ ), the expected crude slope is $R - 1$ (or $-\delta^2/V(x_1)$ ).
Bootstrap Testing: Use bootstrapping to generate a confidence interval for the observed $\beta_c$ $β_{c}$ . If the expected null value ( $R-1$ $R - 1$ ) falls within this interval, the null hypothesis cannot be rejected.
- Crucial Insight: A negative correlation is only biologically significant if it is more negative than the bias dictated by the measurement's repeatability.

D. Empirical Re-evaluations

Lizards: Previous studies claimed a strong trade-off ( $\beta < 0$ ) based on negative slopes. The authors show that if repeatability is moderate ( $R \approx 0.69$ ), the observed slope is statistically indistinguishable from the RTM artifact.
Birds (Telomeres): The uncorrected slope suggested long telomeres shorten faster. The Berry correction suggested no relationship. The Blomqvist correction suggested a moderate relationship but with massive uncertainty. The authors conclude that without precise knowledge of repeatability, no definitive biological conclusion can be drawn; the apparent relationship is likely a statistical artifact.

4. Significance and Implications

Paradigm Shift: The paper challenges the standard practice of blindly applying RTM corrections. It posits that "correcting" data without knowing the measurement error structure introduces new, often larger, biases.
Repeatability is Paramount: The authors assert that no conclusion regarding differential treatment effects is statistically founded without an independent estimate of repeatability ( $R$ ).
Methodological Guidance:
- Researchers should stop testing if the slope is different from zero; they should test if the slope is different from $R-1$ .
- If $R$ is unknown, researchers should use profile likelihood or bootstrap methods to map the range of structural effects compatible with the data, acknowledging the "plateau of non-identifiability."
Impact on Literature: The authors call for a systematic re-examination of published ecological and physiological studies that relied on Berry et al. corrections, suggesting that many reported "trade-offs" or "compensatory growth" may be statistical illusions.

Conclusion

Fontanari and Santos provide a rigorous mathematical framework demonstrating that the most robust approach to RTM is not to correct the data, but to contextualize the uncorrected crude slope against a structural null expectation derived from measurement repeatability. This approach preserves biological intuition while maintaining statistical rigor, preventing the misinterpretation of measurement noise as biological signal.

When correcting for regression to the mean is worse than no correction at all