Getting over ANOVA: Estimation graphics for multi-group… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a mystery: Does a new medicine actually work?

For decades, scientists have used a specific, rigid tool to solve these mysteries called Null-Hypothesis Significance Testing (NHST). Think of this tool like a binary light switch. It only has two settings: ON (the medicine works) or OFF (it doesn't).

The problem? Real life isn't a light switch. It's a dimmer switch with infinite shades of brightness. Just because a light is "ON" doesn't tell you how bright it is. Is it a tiny flicker or a blinding spotlight? The old method often ignores the "how much" and focuses only on the "yes or no," leading to confusion and wasted effort.

This paper introduces a new detective tool called DABEST 2.0. Instead of a light switch, it gives you a high-definition map that shows exactly how big the effect is, how precise the measurement is, and how the data behaves.

Here is how DABEST 2.0 solves four common detective problems, explained with simple analogies:

1. The "Too Many Questions" Problem (Multi-Group Comparisons)

The Old Way: Imagine you have 6 different suspects (groups) and you want to know who is guilty. The old method (ANOVA) first asks, "Is anyone guilty?" If the answer is "Yes," it then forces you to ask 15 separate questions comparing every suspect to every other suspect. It's like checking every single pair of shoes in a closet to see which ones match. It's messy, overwhelming, and you might miss the real culprit because you're too busy counting pairs.

The DABEST Way: Instead of checking every pair, DABEST says, "Let's just compare the suspects directly to the innocent bystander (the control group)."

The Result: You get a clean, simple map showing exactly how much each suspect differs from the innocent one. It cuts the noise and focuses on the story that matters.

2. The "Time Travel" Problem (Repeated Measures)

The Old Way: Imagine tracking a patient's sleep over 7 days. The old method gives you a table of numbers and a bunch of asterisks saying "Significant!" or "Not Significant!" It's like looking at a spreadsheet of temperatures and trying to guess if the weather is getting warmer or colder. You lose the shape of the story.

The DABEST Way: DABEST draws a movie instead of a spreadsheet.

The Top Panel: Shows the actual data points (the raw sleep hours) so you can see the spread.
The Bottom Panel: Shows a "trajectory line" that tells you exactly how much sleep improved each day compared to the start.
The Analogy: It's like watching a runner on a track. You don't just know they finished; you see exactly how their speed changed second-by-second and how confident we are in that speed.

3. The "Double Trouble" Problem (Two-Factor Designs)

The Old Way: Imagine testing a drug on two types of people: those with a genetic mutation and those without. The old method asks, "Is there an interaction?" and gives you a single "Yes/No" answer. It's like asking, "Does the cake taste better with chocolate or vanilla?" and only getting the answer "Yes, there is a difference." It doesn't tell you how much better.

The DABEST Way: DABEST uses a "Delta-Delta" approach. Think of it as a balance scale.

First, it measures how much the mutation changes the baseline (the "placebo" effect).
Then, it measures how much the drug changes the baseline.
Finally, it subtracts the two to find the Net Effect.

The Result: Instead of a vague "Yes," it tells you: "In people with the mutation, the drug adds 5.76 years of survival." It turns a complex math problem into a single, meaningful number you can actually use.

4. The "Yes/No" Problem (Binary Data)

The Old Way: When data is just "Yes" or "No" (like "Did the animal have a seizure?"), scientists often just count the numbers and run a test. It's like saying, "10 people got sick, 5 didn't. The end." It ignores how much the risk changed.

The DABEST Way: DABEST draws a proportion map.

It shows the "Yes" and "No" groups side-by-side with error bars (like a safety net).
It calculates the percentage drop in seizures.
The Analogy: Instead of just saying "The drug worked," it says, "The drug reduced seizures by 68%, and we are very confident the real number is between 53% and 83%."

The "Mini-Meta" Bonus

Finally, the paper addresses the problem of replication. Sometimes scientists run the same experiment three times and get three different results. The old way is to either hide the "bad" results or mash them all together into one big, confusing blob.

DABEST 2.0 introduces a "Mini-Meta" view. It's like a collage.

It shows the result of Experiment 1, Experiment 2, and Experiment 3 side-by-side.
Then, it draws a final "summary arrow" that combines them all, weighted by how reliable each one was.
The Benefit: You see the whole picture, including the contradictions, rather than hiding the messy parts.

The Bottom Line

The authors of this paper are saying: "Stop guessing with light switches. Start measuring with maps."

DABEST 2.0 is a free software tool (available for computers and the web) that helps scientists move away from the confusing "Is it significant?" game and toward the more useful "How big is the effect?" reality. It makes data transparent, easier to understand, and much more honest about what is actually happening in the lab.

1. Problem Statement

The paper addresses critical limitations in current statistical practices within experimental science, particularly in biological research:

Over-reliance on Null-Hypothesis Significance Testing (NHST): NHST promotes a misleading binary decision-making process ("significant" vs. "non-significant"), leading to overconfidence in results, poor reproducibility, and a neglect of effect size quantification.
Inadequacy of ANOVA for Multi-Group Designs: Conventional multi-group analysis relies on Analysis of Variance (ANOVA) followed by post-hoc pairwise comparisons.
- Unfocused Hypotheses: ANOVA tests a global null hypothesis ("all means are equal"), which provides no specific information about which groups differ, the direction of the difference, or the magnitude.
- Multiplicity Issues: Post-hoc testing requires a large number of pairwise comparisons ( $m = g(g-1)/2$ ), necessitating corrections (e.g., Bonferroni) that reduce statistical power and increase false negatives.
- Lack of Visualization: Existing estimation tools are largely limited to simple two-group designs, failing to support complex experimental structures common in biology (e.g., repeated measures, factorial designs, binary data).

2. Methodology

The authors introduce DABEST 2.0 (Decision Analysis by Estimation Statistics), an updated software framework designed to replace ANOVA-based workflows with Estimation Statistics.

Core Philosophy: Shift focus from $p$ -values to effect size estimation, precision intervals (confidence intervals), and data graphics.
Statistical Approach:
- Bootstrapping: Uses bias-corrected and accelerated (BCa) bootstrap resampling (5,000 iterations) to calculate effect sizes and confidence intervals without assuming normal distributions. This is robust for small or skewed samples.
- Visualization: Employs "Cumming plots" (or Gardner-Altman plots) which display raw data distributions (top panel) alongside effect size distributions with confidence intervals (bottom panel).
New Analytical Modules in DABEST 2.0:
1. Repeated-Measures Estimation: Instead of testing all pairwise time points, this focuses on specific comparisons (e.g., Treatment vs. Baseline) to visualize effect trajectories over time.
2. Delta-Delta Analysis (Two-Way Factorial): For designs with two independent variables (e.g., Genotype $\times$ Treatment), it calculates the net effect of an intervention by subtracting the background effect (e.g., placebo) from the treatment effect. This replaces the interaction term of ANOVA with a directly interpretable magnitude.
3. Proportional Data Analysis: For binary outcomes (e.g., seizure occurrence), it visualizes proportions with error bars and calculates effect sizes (e.g., Cohen's $h$ ) or uses Sankey diagrams for repeated binary measures.
4. Mini-Meta-Analysis: Allows for the synthesis of internally replicated experiments. It calculates a weighted average effect size using pooled variances, promoting transparency over "mega-analyses" that hide inter-experiment variability.

3. Key Contributions

Software Expansion: DABEST 2.0 extends the estimation framework from simple two-group comparisons to complex designs:
- Longitudinal/Repeated-measures studies.
- Two-way factorial designs (via Delta-Delta).
- Binary/Proportional data analysis.
- Meta-analysis of internal replicates.
Cross-Platform Accessibility: The tool is available as Python (dabest), R (dabestr), and a web application (estimationstats.com), making it accessible to non-programmers and scripters alike.
Visual Standardization: It provides a unified visual language (swarm plots, half-violin curves, confidence interval bars) that conveys both descriptive (raw data) and inferential (effect size) information simultaneously.

4. Results (Demonstrated via Simulated Scenarios)

The paper validates DABEST 2.0 through four fictional scenarios contrasting NHST/ANOVA with Estimation Graphics:

Repeated Measures (Sleep Study):
- ANOVA approach: Required 15 pairwise tests with adjusted $p$ -values, obscuring the specific trajectory of improvement.
- Estimation approach: Clearly visualized the magnitude of sleep improvement on specific days (e.g., Day 2: +247.8 min [95% CI: 219.5, 274.6]) relative to baseline, showing the plateau and subsequent decline without multiplicity corrections.
Two-Way Factorial (Drug $\times$ Genotype):
- ANOVA approach: Identified a significant interaction ( $p = 6.21 \times 10^{-7}$ ) but left the specific benefit to mutation carriers ambiguous.
- Estimation approach (Delta-Delta): Directly quantified the net drug benefit in mutation carriers as +5.76 years [95% CI: 3.60, 7.89], accounting for placebo effects and providing a clinically interpretable magnitude.
Binary Data (Seizure Model):
- ANOVA approach: Often reported as bar charts without error or effect size.
- Estimation approach: Showed a 68% reduction in seizures [95% CI: 53, 83] and used Sankey plots to visualize categorical changes over time, providing precision estimates previously missing in literature.
Mini-Meta-Analysis (Replicated Trials):
- Demonstrated how to synthesize three pilot trials with conflicting results (two positive, one negative) into a single weighted summary effect (+21.74 [95% CI: 4.51, 39.32]), revealing the overall trend while transparently displaying inter-experiment variance.

5. Significance

Reproducibility: By prioritizing effect sizes and confidence intervals over binary significance, the framework encourages transparent reporting and reduces the "file-drawer problem" (hiding negative replicates).
Scientific Interpretation: It shifts the researcher's focus from "Is there an effect?" to "How large is the effect and how precise is the estimate?" This is crucial for clinical and biological decision-making where magnitude matters more than statistical significance.
Practical Utility: DABEST 2.0 fills a critical gap in statistical software, providing the first accessible, comprehensive tool for estimation graphics in complex multi-group experimental designs, thereby facilitating the transition away from traditional ANOVA workflows.

Getting over ANOVA: Estimation graphics for multi-group comparisons