Statistical significance in choice modelling: computation, usage and reporting

This paper critiques the over-reliance on and misinterpretation of statistical significance in choice modelling, advocating for more precise reporting of uncertainty measures and a greater emphasis on behavioural and policy significance alongside statistical findings.

Stephane Hess, Andrew Daly, Michiel Bliemer, Angelo Guevara, Ricardo Daziano, Thijs Dekker

Published 2026-03-10
📖 6 min read🧠 Deep dive

Imagine you are a detective trying to solve a mystery: Why do people choose the bus over the car, or the train over the bike?

To solve this, you build a "crystal ball" (a statistical model) that looks at data from thousands of trips. This crystal ball gives you numbers (estimates) that tell you how much people dislike waiting for a bus or how much they hate paying for a ticket.

But here's the problem: Your crystal ball isn't perfect. It's based on a sample of people, not every person on earth. So, your numbers have a little bit of "fuzziness" or uncertainty.

This paper is a guide for detectives (choice modellers) on how to talk about that fuzziness without lying to themselves or the public. It argues that the field has become too obsessed with a specific "magic number" (95% confidence) and has forgotten to ask the real question: "Does this actually matter?"

Here is the breakdown of the paper using simple analogies:

1. The "Fuzziness" of the Crystal Ball (Uncertainty)

When you estimate a number, you aren't getting the "True Truth." You are getting a "Best Guess."

  • The Analogy: Imagine trying to guess the average height of everyone in a city by measuring just 50 people. If you picked a different 50 people, you'd get a slightly different average.
  • The Paper's Point: We need to measure how much our guess might wiggle if we picked different people. We do this using Standard Errors (how much the guess wiggles) and Confidence Intervals (a range where the true answer probably lives).
  • The Trap: Sometimes, the "fuzziness" is bigger than we think because we didn't account for the fact that the same person made multiple trips (repeated choices). It's like measuring the same person's height 10 times and pretending you measured 10 different people. That makes your guess look too precise when it's actually sloppy.

2. The "Magic 95%" Rule (Statistical Significance)

For a long time, scientists have agreed on a rule: If your "fuzziness" is small enough that there's less than a 5% chance your result is just luck, you call it "Statistically Significant." This is the famous p < 0.05 rule.

  • The Analogy: Imagine a security guard at a club. The rule is: "If there's a 95% chance this person is a VIP, let them in."
  • The Problem: The paper argues that the guard is too rigid.
    • Big Data Bias: If you have a huge crowd (a massive dataset), even a tiny, meaningless difference can pass the 95% test. It's like the guard letting in a VIP who is only 1 inch taller than the average person. It's "significant" but useless.
    • Small Data Bias: If you have a small crowd, a really important difference might get rejected because the "fuzziness" is high. It's like the guard kicking out a real VIP because they were wearing a hat that made them look shorter.
  • The Advice: Don't just look at the 95% line. Ask: "Is this effect big enough to change a policy?" If a new train line saves people 2 minutes a day, it might not be "statistically significant" in a small study, but it's still a great idea for the city.

3. The "Three Musketeers" of Testing (Hypothesis Tests)

When you want to prove your crystal ball is right, you use three different tools (tests) to check your work. The paper calls them the Likelihood Ratio, Wald, and Lagrange Multiplier tests.

  • The Analogy: Imagine you are testing a new recipe.
    • Wald Test: You taste the soup now and guess if it needs more salt based on the current flavor. (Fast, but relies on assumptions).
    • Likelihood Ratio: You cook the soup without salt, taste it, then cook it with salt, taste it again, and compare the two. (Slower, but more accurate).
    • Lagrange Multiplier: You look at the pot before you added the salt and guess if the steam suggests it needs more. (Hard to do, but useful in some cases).
  • The Advice: The paper says the "Wald test" (the t-ratio most people use) is often too simple. If you have the time, use the "Likelihood Ratio" (comparing the full model to a restricted one) because it's more honest about the data.

4. The "Star" System (Reporting Results)

In many scientific papers, you see numbers with stars next to them: *, **, ***.

  • The Analogy: It's like a movie rating. *** means "Great," * means "Okay."
  • The Problem: The paper says this is dangerous. If you only see the stars, you don't know how great the movie is, or if the rating was based on a one-sided review or a two-sided one.
  • The Advice: Stop hiding behind stars. Show the actual numbers (the estimate and the standard error). Let the reader decide if the result is good enough. If you hide the numbers, you can't calculate the "confidence interval" (the range of truth).

5. The "Significance" vs. "Importance" Trap

This is the most important lesson.

  • Significance: "Is this result real, or just a fluke?" (Did the coin land on heads 10 times in a row by chance?)
  • Importance: "Does this result matter?" (If the coin lands on heads, does it change the outcome of the game?)
  • The Analogy: Imagine you are testing a new medicine.
    • Significant but Useless: The medicine cures a headache in 0.001 seconds faster than a placebo. It is "statistically significant" (because you tested 1 million people), but it's useless to a patient.
    • Not Significant but Vital: The medicine cures a headache in 10 minutes, but your sample size was small, so the math says "we aren't 95% sure." But if you ignore it, people suffer.
  • The Advice: In choice modelling (like transport planning), we need to care about Policy Importance. If a variable (like cost) makes sense logically, keep it in the model even if the math says it's "weak." Don't throw away a variable just because it didn't pass the 95% test.

Summary: What Should You Do?

The authors are telling choice modellers to:

  1. Stop obsessing over the 95% line. It's an arbitrary rule that breaks with big or small data.
  2. Be honest about the "fuzziness." Report the actual numbers (standard errors), not just stars.
  3. Ask "So What?" A result can be statistically real but practically useless. Focus on whether the finding changes how we understand human behavior or helps make better policies.
  4. Use better tools. If you have complex data (like people making many trips), use better math (like bootstrapping) to get the right "fuzziness" measurement.

In a nutshell: Don't let the math trick you into thinking a tiny, meaningless difference is a breakthrough, and don't let the math trick you into throwing away a potentially huge idea just because the sample size was small. Use your brain, not just your calculator.