Are all models wrong? Falsifying binary formation models in gravitational-wave astronomy

This paper introduces a frequentist pp-value method to test the adequacy of gravitational-wave formation models, demonstrating that while some proposed explanations for exceptional events like GW190521 are sufficient, others fail to adequately account for the observed data.

Original authors: Lachlan Passenger, Eric Thrane, Paul D. Lasky, Ethan Payne, Simon Stevenson, Ben Farr

Published 2026-05-11
📖 5 min read🧠 Deep dive

Original authors: Lachlan Passenger, Eric Thrane, Paul D. Lasky, Ethan Payne, Simon Stevenson, Ben Farr

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Picture: Are We Missing Something?

Imagine you are a detective trying to figure out how a specific type of crime happens. You have a theory (a "model") about how these crimes are committed. Usually, you check your theory by looking at a bunch of cases and seeing if your theory fits the average ones.

But sometimes, a case comes along that is wildly different from the rest. It's so strange that it makes you wonder: "Is my theory actually wrong? Or is this just a lucky fluke?"

In the world of gravitational waves (ripples in space-time caused by colliding black holes), scientists have found a few "exceptional" events. One famous example is GW190521, a collision involving two black holes so massive that, according to standard rules of physics, they shouldn't exist. They fall into a "forbidden zone" (called the pair-instability mass gap) where stars are supposed to explode before they can get that big.

Scientists have built many new theories to explain how these giant black holes could form. But here is the problem: Just because a theory can explain the weird event doesn't mean it's a good explanation.

The Problem with Current Methods

Usually, scientists use a tool called "Bayesian model selection" to compare theories. Think of this like a race. If you have three runners (three theories) and one wins, you declare the winner the "best."

But what if all three runners are terrible? What if they all run so slowly that they can't actually finish the race? A race only tells you who is least bad; it doesn't tell you if anyone is actually good enough to do the job.

This paper asks a different question: "Does this specific theory actually have the ability to explain this weird event, even if we don't compare it to other theories?"

The New Tool: The "Unusualness" Test

The authors created a new statistical method to answer this. Here is how it works, using a cookie factory analogy:

  1. The Factory (The Model): Imagine a cookie factory that makes cookies of different sizes. The factory has a rule: "We only make cookies between 2 and 4 inches wide."
  2. The Batches (Simulations): The scientists run the factory's computer program 100 times. Each time, they generate a "batch" of 100 cookies (simulated black hole collisions).
  3. The Biggest Cookie (The Extremal Event): In every batch, they find the single biggest cookie.
  4. The Pattern: After running 100 batches, they look at the sizes of those "biggest cookies." They build a map showing what the "biggest cookie" usually looks like in this factory.
  5. The Real Mystery: Now, they look at the real giant cookie found in nature (GW190521).
  6. The Test: They ask: "If we ran this factory 100 times, how often would we get a 'biggest cookie' that is this weird?"

They calculate a score called a p-value.

  • High Score (Good): If the factory often produces a "biggest cookie" this size, the theory is plausible. The factory can make this cookie.
  • Low Score (Bad): If the factory almost never makes a cookie this size, the theory is likely wrong. The factory is broken, or the rules are wrong.

What They Tested

The scientists applied this test to four different "factories" (theories) that try to explain GW190521:

  1. AGN Model (Small Seeds): Black holes growing in the disks of giant galaxies, but starting with small "seeds" (max 15 solar masses).
    • Result: Fail. This factory almost never makes cookies this big. The theory is effectively ruled out.
  2. AGN Model (Medium Seeds): Same as above, but starting with medium seeds (max 50 solar masses).
    • Result: Suspicious. It's very rare for this factory to make a cookie this big. It's not impossible, but it's unlikely (about a 1 in 100 chance).
  3. AGN Model (Large Seeds): Same as above, but starting with large seeds (max 75 solar masses).
    • Result: Pass. This factory makes cookies this size quite often. The theory is a plausible explanation.
  4. Globular Cluster Model: Black holes forming in dense star clusters.
    • Result: Pass. This factory also makes cookies this size reasonably often. The theory is plausible.

The "Signal-to-Noise" Twist

The paper also highlights a clever detail. Imagine you see a cookie, but it's blurry.

  • If the cookie is blurry (low signal), you aren't sure if it's actually huge or just looks huge because of the blur.
  • If the cookie is crystal clear (high signal) and it's huge, you know for sure it's huge.

The authors' method takes this "blur" into account. If a theory claims to explain a crystal-clear, massive event, but the math says that event is impossible for that theory, the theory gets a very low score. If the event is blurry, the score is a bit more forgiving. This makes the test more accurate than previous methods.

The Conclusion

The paper concludes that not all models are created equal.

  • Some models (like the one with small starting seeds) are simply wrong for explaining the massive black hole GW190521.
  • Other models (those with larger starting seeds or specific cluster dynamics) can explain it.

The main takeaway is that we need to stop just ranking models against each other. Instead, we need to test if our models are even capable of explaining the most extreme events in the universe. If a model can't explain the "weird" stuff, it's not a good model, no matter how well it explains the "normal" stuff.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →