📄 cardiovascular medicine

Understanding unexpected results from randomized clini{square}cal trials Does coffee reduce atrial fibrillation recurrences?

This paper demonstrates that applying supplemental frequentist and Bayesian analyses to a randomized controlled trial on coffee and atrial fibrillation reveals that while the original findings were statistically significant, they likely suffer from type M error and offer only modest probabilities of clinically meaningful benefit, thereby highlighting the importance of robustness checks for unexpected trial results.

Original authors: Brophy, J. M.

Published 2026-04-17

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Brophy, J. M.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you're a detective trying to solve a mystery: Does drinking coffee actually help prevent your heart from skipping a beat (atrial fibrillation), or is it the other way around?

For decades, the medical community believed coffee was like a "heart accelerator"—a dangerous fuel that made heart problems worse. But then, a new study called DECAF came along with a shocking headline: "Drinking coffee actually reduces heart skips!"

The study was a "Randomized Controlled Trial" (RCT), which is usually considered the gold standard of evidence. It took 200 people with heart issues, split them into two groups, and told one group to keep drinking their coffee and the other to quit. The results showed the coffee drinkers had fewer heart skips. The math said this result was "statistically significant" (p < 0.01), meaning it was very unlikely to be a fluke.

But here is the twist: The author of this paper, James Brophy, thinks the original study might be like a magician pulling a rabbit out of a hat that wasn't actually there. He decided to put on his own detective hat and re-examine the evidence using two different tools: Frequentist math (the standard way) and Bayesian math (a way that weighs new evidence against old beliefs).

Here is the story of what he found, explained simply.

1. The "Small Sample" Problem (The Coin Flip)

The original study had 200 people. The authors assumed they would get exactly 100 people in the coffee group and 100 in the no-coffee group.

The Analogy: Imagine flipping a coin 200 times. The authors assumed they would get exactly 100 heads and 100 tails.
The Reality: In the real world, getting exactly 100/100 is incredibly rare (only a 5.7% chance!). It's like flipping a coin and getting a perfect split every single time. The author points out that the study design was a bit too optimistic about how perfectly the groups would balance out.

2. The "Weak Flashlight" Problem (Power and Type M Error)

This is the biggest issue. The original study was designed to find a huge benefit (like a 41% reduction in heart skips).

The Analogy: Imagine you are trying to spot a tiny firefly in a dark forest using a very weak flashlight. You only have enough battery to look for a giant bonfire.
The Reality: If you use a weak flashlight (a small study) to look for a small, realistic benefit (like a 15% reduction), you will almost certainly miss it.
The Trap: However, if you do see a "light" (a statistically significant result) with such a weak flashlight, it is likely a Type M (Magnitude) Error. This means the light you saw is probably a giant, glowing bonfire that is actually just a small candle. The study found a "big" benefit, but because the study was too small to detect small, realistic benefits, the result is likely exaggerated. It's like seeing a shadow and thinking it's a giant monster, when it's actually just a small dog.

3. The "Old Beliefs" Problem (The Bayesian Approach)

This is where the author uses Bayesian analysis.

The Analogy: Imagine you are a judge in a courtroom.
- The Standard Approach (Frequentist): The judge looks only at the evidence presented in the courtroom today (the DECAF study) and ignores everything else. If the evidence looks good, the defendant is guilty.
- The Bayesian Approach: The judge looks at the evidence today but also remembers that the defendant has a long history of being innocent in the past (the medical belief that coffee is bad for the heart).
The Reality: The author argues that we can't just ignore 50 years of medical history that says "coffee is bad for arrhythmias." When you combine the new "surprising" data with the old "suspicious" beliefs, the new data doesn't look quite so convincing.
- The original study said: "There is a 99% chance coffee helps!"
- The Bayesian re-analysis said: "Well, given our old beliefs, there's only an 88% chance that coffee helps enough to be clinically useful." It tempers the excitement. It says, "Maybe it helps a little, but maybe not as much as the headline suggests."

4. The "Gut Feeling" vs. "Data"

The author notes something funny: Even though the math said coffee was good, the original doctors who wrote the paper were hesitant to say "Drink coffee!" They used cautious language like "associated with" rather than "causes."

The Analogy: It's like a weather forecaster who sees a perfect sunny forecast on their computer but still tells you to bring an umbrella because "it just feels like rain."
The Lesson: The author argues that while "gut feelings" are important, we shouldn't let them override the data too much. But conversely, we shouldn't let a single, small, underpowered study override decades of medical wisdom either.

The Big Takeaway

This paper isn't saying "Coffee is definitely bad" or "Coffee is definitely good." It is saying: "We need to be smarter about how we read surprising news."

When a study comes out with a result that goes against everything we know (like coffee helping the heart), we should:

Check the Flashlight: Was the study big enough to find small, realistic effects? (In this case, no).
Check the Magnifying Glass: Did the study exaggerate the size of the effect because it was too small? (Likely yes).
Check the History: Does this new result fit with what we already know? (It doesn't fit well).

The Conclusion:
The DECAF study is a great example of how a "statistically significant" result can still be misleading. By using better math (Bayesian methods) and being honest about the study's limitations, the author shows that the benefit of coffee is likely modest, not the miracle cure the headlines suggested.

It's a reminder that in science, surprising results need extra scrutiny, not just celebration. Just because a number is "significant" doesn't mean the story is true.

1. Problem Statement

The paper addresses the challenge of interpreting "unexpected" or "surprising" results from Randomized Controlled Trials (RCTs) that contradict established medical beliefs. Specifically, it critiques the DECAF trial (published in JAMA), which reported that caffeinated coffee consumption (approx. 1 cup/day) significantly reduced atrial fibrillation (AF) recurrence compared to abstinence following cardioversion.

This finding is counter-intuitive because caffeine has historically been considered proarrhythmic. The author argues that standard frequentist interpretations of such surprising results often lack robustness, potentially leading to:

Type M (Magnitude) Errors: Overestimation of effect sizes due to low statistical power.
Confusion between Statistical and Clinical Significance: A statistically significant $p$ -value does not guarantee a clinically meaningful benefit.
Ignoring Prior Knowledge: Failure to incorporate historical context (e.g., caffeine's known risks) into the analysis.

2. Methodology

The author performed a secondary analysis of the DECAF trial data using a combination of adjunctive frequentist and Bayesian approaches.

A. Data Reconstruction

Individual Patient Data (IPD) Extraction: Since raw IPD was not available, cumulative incidence curves from the original publication were extracted using WebPlotDigitizer.
Transformation: The extracted data were transformed into survival format using the Guyot algorithm (via the IPDfromKM R package) to reconstruct a Kaplan-Meier plot and generate pseudo-IPD for secondary analysis.

B. Frequentist Re-evaluation (Power and Type M Error)

Power Analysis: The author recalculated the statistical power of the DECAF trial (N=200) under more realistic effect sizes. The original design assumed a 41% relative risk reduction (RRR) to achieve 80% power.
Type M Error Assessment: Using the retrodesign package, the author assessed the likelihood that a statistically significant result from an underpowered study would exaggerate the true effect size.

C. Bayesian Survival Analysis

Modeling: A Bayesian Cox proportional hazards model and a binomial risk difference model were fitted using the brms package (interface to Stan).
Priors:
- Baseline Risk: A weakly informative prior centered at 50% recurrence (logit scale mean 0, SD 1.5).
- Treatment Effect: A prior centered on a 41% reduction (logit scale -0.871) but reflecting the historical belief that caffeine is harmful. Consequently, the prior distribution was skewed to favor the abstinence (decaf) group, assigning a 93.5% probability of benefit to abstinence and only 6.5% to caffeine.
Computation: Four chains were run with 2,000 warm-up and 6,000 sampling iterations using Hamiltonian Monte Carlo (HMC) with the No-U-Turn Sampler (NUTS). Convergence was verified via Gelman-Rubin statistics.

3. Key Contributions

Methodological Framework: Demonstrates how to combine frequentist power analysis (specifically Type M error) with Bayesian re-analysis to contextualize surprising RCT findings.
Critique of Trial Design: Highlights specific flaws in the DECAF trial, including:
- Underpowered Design: The study had only ~24% power to detect a realistic 15% relative risk reduction, making any significant finding likely an exaggeration.
- Directional Ambiguity: The trial did not pre-specify which arm (caffeinated vs. abstinent) was hypothesized to be beneficial, increasing the risk of "researcher degrees of freedom" bias.
- Group Balance: The assumption of a perfect 100/100 split in a 200-patient trial is statistically improbable (only 5.7% chance).
Distinction of Significance: Provides a clear framework for distinguishing between statistical significance (low $p$ -value) and clinical significance (probability of a meaningful effect size).

4. Key Results

Frequentist Findings

Power: To detect a realistic 15% relative risk reduction (from 50% to 42.5%), the study had only 24% power.
Type M Error: Given the low power, any statistically significant result is likely to overstate the true effect by a factor of approximately two.
Reconstructed Data: The reconstructed Kaplan-Meier analysis yielded a Hazard Ratio (HR) of 0.62 (95% CI 0.43–0.91), closely matching the original published HR of 0.61.

Bayesian Findings

Posterior Hazard Ratio: After incorporating the prior belief that caffeine is likely harmful, the posterior HR shifted from 0.61 to 0.74 (95% CrI 0.53–1.04).
- Interpretation: The evidence for a benefit is weaker than the frequentist result suggests. The 95% Credible Interval now includes 1.0 (no effect).
Probabilities of Benefit:
- Probability that Caffeine is better than Abstinence (HR < 1): ~96%.
- Probability of a clinically meaningful benefit (HR < 0.9): ~88%.
- Probability of a clinically meaningful Risk Difference (RD < -2%, or NNT ≤ 50): ~82%.
Risk Difference: The posterior mean risk difference was -7.6% (95% CrI: -19.5% to +4.4%), compared to the frequentist estimate of -17%. The Bayesian approach pulled the estimate toward the null, reflecting the influence of prior skepticism.

5. Significance and Conclusion

Contextualization of Evidence: The paper argues that "unexpected" results should not be accepted at face value. The DECAF trial's strong $p$ -value ( $p < 0.01$ ) was likely a product of low power and Type M error, rather than a true large effect.
Role of Priors: Contrary to the view that priors introduce bias, the author demonstrates that transparent, well-justified priors (based on historical data and trial design assumptions) act as a necessary "reality check" against over-interpretation of surprising data.
Clinical Decision Making: While the frequentist analysis suggested a strong case for coffee consumption, the Bayesian analysis suggests the evidence is "modest at best." The probability of a clinically meaningful benefit is not near certainty.
Broader Implication: The author notes that this is not an isolated case, citing a review of other high-impact RCTs with surprising results. The paper calls for:
1. Rigorous sample size calculations based on realistic effect sizes.
2. Pre-specification of directional hypotheses.
3. Adoption of Bayesian methods to quantify the probability of clinical significance and integrate prior knowledge, thereby preventing premature changes in clinical practice based on underpowered, surprising trials.

Final Takeaway: The DECAF trial's finding that coffee reduces AF recurrence is likely an exaggeration of a smaller, potentially non-existent effect. Supplemental Bayesian analysis is essential to temper such findings and prevent the medical community from adopting interventions based on statistical artifacts rather than robust clinical evidence.