Are all models wrong? Falsifying binary formation… — Plain-Language Explanation

Original authors: Lachlan Passenger, Eric Thrane, Paul D. Lasky, Ethan Payne, Simon Stevenson, Ben Farr

Published 2026-05-11

📖 5 min read🧠 Deep dive

Original authors: Lachlan Passenger, Eric Thrane, Paul D. Lasky, Ethan Payne, Simon Stevenson, Ben Farr

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Picture: Are We Missing Something?

Imagine you are a detective trying to figure out how a specific type of crime happens. You have a theory (a "model") about how these crimes are committed. Usually, you check your theory by looking at a bunch of cases and seeing if your theory fits the average ones.

But sometimes, a case comes along that is wildly different from the rest. It's so strange that it makes you wonder: "Is my theory actually wrong? Or is this just a lucky fluke?"

In the world of gravitational waves (ripples in space-time caused by colliding black holes), scientists have found a few "exceptional" events. One famous example is GW190521, a collision involving two black holes so massive that, according to standard rules of physics, they shouldn't exist. They fall into a "forbidden zone" (called the pair-instability mass gap) where stars are supposed to explode before they can get that big.

Scientists have built many new theories to explain how these giant black holes could form. But here is the problem: Just because a theory can explain the weird event doesn't mean it's a good explanation.

The Problem with Current Methods

Usually, scientists use a tool called "Bayesian model selection" to compare theories. Think of this like a race. If you have three runners (three theories) and one wins, you declare the winner the "best."

But what if all three runners are terrible? What if they all run so slowly that they can't actually finish the race? A race only tells you who is least bad; it doesn't tell you if anyone is actually good enough to do the job.

This paper asks a different question: "Does this specific theory actually have the ability to explain this weird event, even if we don't compare it to other theories?"

The New Tool: The "Unusualness" Test

The authors created a new statistical method to answer this. Here is how it works, using a cookie factory analogy:

The Factory (The Model): Imagine a cookie factory that makes cookies of different sizes. The factory has a rule: "We only make cookies between 2 and 4 inches wide."
The Batches (Simulations): The scientists run the factory's computer program 100 times. Each time, they generate a "batch" of 100 cookies (simulated black hole collisions).
The Biggest Cookie (The Extremal Event): In every batch, they find the single biggest cookie.
The Pattern: After running 100 batches, they look at the sizes of those "biggest cookies." They build a map showing what the "biggest cookie" usually looks like in this factory.
The Real Mystery: Now, they look at the real giant cookie found in nature (GW190521).
The Test: They ask: "If we ran this factory 100 times, how often would we get a 'biggest cookie' that is this weird?"

They calculate a score called a p-value.

High Score (Good): If the factory often produces a "biggest cookie" this size, the theory is plausible. The factory can make this cookie.
Low Score (Bad): If the factory almost never makes a cookie this size, the theory is likely wrong. The factory is broken, or the rules are wrong.

What They Tested

The scientists applied this test to four different "factories" (theories) that try to explain GW190521:

AGN Model (Small Seeds): Black holes growing in the disks of giant galaxies, but starting with small "seeds" (max 15 solar masses).
- Result: Fail. This factory almost never makes cookies this big. The theory is effectively ruled out.
AGN Model (Medium Seeds): Same as above, but starting with medium seeds (max 50 solar masses).
- Result: Suspicious. It's very rare for this factory to make a cookie this big. It's not impossible, but it's unlikely (about a 1 in 100 chance).
AGN Model (Large Seeds): Same as above, but starting with large seeds (max 75 solar masses).
- Result: Pass. This factory makes cookies this size quite often. The theory is a plausible explanation.
Globular Cluster Model: Black holes forming in dense star clusters.
- Result: Pass. This factory also makes cookies this size reasonably often. The theory is plausible.

The "Signal-to-Noise" Twist

The paper also highlights a clever detail. Imagine you see a cookie, but it's blurry.

If the cookie is blurry (low signal), you aren't sure if it's actually huge or just looks huge because of the blur.
If the cookie is crystal clear (high signal) and it's huge, you know for sure it's huge.

The authors' method takes this "blur" into account. If a theory claims to explain a crystal-clear, massive event, but the math says that event is impossible for that theory, the theory gets a very low score. If the event is blurry, the score is a bit more forgiving. This makes the test more accurate than previous methods.

The Conclusion

The paper concludes that not all models are created equal.

Some models (like the one with small starting seeds) are simply wrong for explaining the massive black hole GW190521.
Other models (those with larger starting seeds or specific cluster dynamics) can explain it.

The main takeaway is that we need to stop just ranking models against each other. Instead, we need to test if our models are even capable of explaining the most extreme events in the universe. If a model can't explain the "weird" stuff, it's not a good model, no matter how well it explains the "normal" stuff.

Technical Summary: Falsifying Binary Formation Models in Gravitational-Wave Astronomy Using Exceptional Events

Problem Statement
As the catalogue of gravitational-wave (GW) transients expands, specific events appear "exceptional" relative to the broader population. Notable examples include GW190521, which likely contained black holes within the pair-instability mass gap ( $\sim 50-135 M_\odot$ ), and GW190814, characterized by an extreme mass ratio and a secondary component mass of $\sim 2.6 M_\odot$ . While a "model-building industry" has emerged to explain these events, standard Bayesian model selection is limited. It provides a relative ranking of models but cannot answer the fundamental question: Does any of our current models provide an adequate explanation for these exceptional events? If existing models are inadequate, simply ranking them is insufficient; new models are required.

Methodology
The authors introduce a frequentist framework to test whether a specific population model can plausibly explain the most exceptional events observed, without directly comparing it to alternative models. This approach extends the posterior predictive check methodology of Fishbach et al. (2020b) to account for measurement uncertainty.

The core of the method involves the following steps:

Simulation of Extremal Events: For a given population model $M$ , the authors simulate $N$ events (e.g., $N=100$ ) to create a catalogue. They identify the "apparently most extreme" event in each catalogue (e.g., the event with the highest total mass).
Handling Measurement Uncertainty: Unlike previous methods that rely on maximum likelihood estimates, this method incorporates the full posterior distribution of the event parameters. The authors define a "normalised evidence" metric, $Z$ , which is the ratio of the model's prior probability density (conditioned on detection and catalogue size) to a uniform prior, averaged over the measurement likelihood:
$Z \equiv \frac{\int d\theta \, \mathcal{L}(d|\theta_{\text{ext}}) \pi(\theta_{\text{ext}}|M, \text{det}, N)}{\int d\theta \, \mathcal{L}(d|\theta_{\text{ext}}) \pi(\theta_{\text{ext}}|U)}$
Here, $\mathcal{L}$ is the likelihood function, and $\pi(\theta|U)$ is a uniform prior.
P-value Calculation: By generating an empirical distribution of $Z$ $Z$ from many simulated catalogues, the authors calculate a $p$ $p$ -value for an observed exceptional event. This $p$ $p$ -value represents the fraction of simulated extremal events that are less consistent with the model (i.e., have a lower $Z$ $Z$ ) than the observed event.
- A small $p$ -value indicates the observed event is unusual under the model, suggesting the model is inadequate.
- A large $p$ -value ( $O(1)$ ) indicates the event is consistent with the model's predictions for extremal events.

Key Contributions

A New Statistical Metric: The introduction of the "normalised evidence" $Z$ allows for the assessment of model consistency while explicitly accounting for parameter estimation uncertainty (signal-to-noise ratio effects), which maximum-likelihood-based methods miss.
Frequentist Model Criticism: The paper advocates for a multi-pronged approach to model testing, distinguishing between relative model comparison (Bayes factors) and absolute model adequacy (falsification via $p$ -values).
Computational Efficiency: By focusing solely on the most exceptional events rather than the entire catalogue, the method significantly reduces the computational cost compared to "maximum population likelihood" approaches.

Results
The authors applied this framework to test four variations of binary formation models against the event GW190521:

AGN Models (Gayathri et al. 2023): Three variations based on the maximum allowed natal black hole mass ( $m_{\text{max}}$ $m_{max}$ ).
- $m_{\text{max}} = 15 M_\odot$ : $p \simeq 0$ . The model almost never produces events as massive as GW190521 and is effectively ruled out.
- $m_{\text{max}} = 50 M_\odot$ : $p = 0.01$ . The model is disfavored at the two-sigma level; GW190521 is considered very unusual under this model.
- $m_{\text{max}} = 75 M_\odot$ : $p = 0.61$ . The model frequently produces GW190521-like events and provides an adequate explanation.
Globular Cluster Model (Rodriguez et al. 2019): Assuming zero natal black hole spins.
- $p = 0.12$ . The model reasonably explains the event, suggesting it is plausible to draw a GW190521-like event from this population.

The study demonstrates that hierarchical merger scenarios in both Active Galactic Nuclei (AGN) and globular clusters can bridge the pair-instability mass gap, provided specific conditions (high natal masses or zero natal spins) are met.

Significance and Claims
The paper claims to provide a rigorous method for "model criticism" in gravitational-wave astronomy. By shifting the focus from relative model ranking to absolute model adequacy, the authors argue that this method can identify when none of the tested models are sufficient, thereby motivating the development of new formation channels.

The authors emphasize that their approach complements existing tools:

Unlike Bayes factors, which only compare models relative to one another, this method tests if a model fits the data at all.
Unlike leave-one-out outlier tests, which check self-consistency across data subsets, this method specifically targets the ability of a model to explain the most extreme outliers.
Unlike maximum population likelihood methods, this approach is computationally cheaper as it isolates exceptional events.

The paper concludes that this framework is a "posterior predictive check" that avoids the shortcomings of purely Bayesian or frequentist approaches by utilizing a $p$ -value derived from a distribution of Bayes factors (normalised evidences). The authors suggest this method could be extended to test models against other exceptional properties, such as extreme spins, extreme mass ratios (e.g., GW190814), or small secondary masses.

Are all models wrong? Falsifying binary formation models in gravitational-wave astronomy