Statistical significance in choice modelling: computation, usage and reporting

Imagine you are a detective trying to solve a mystery: Why do people choose the bus over the car, or the train over the bike?

To solve this, you build a "crystal ball" (a statistical model) that looks at data from thousands of trips. This crystal ball gives you numbers (estimates) that tell you how much people dislike waiting for a bus or how much they hate paying for a ticket.

But here's the problem: Your crystal ball isn't perfect. It's based on a sample of people, not every person on earth. So, your numbers have a little bit of "fuzziness" or uncertainty.

This paper is a guide for detectives (choice modellers) on how to talk about that fuzziness without lying to themselves or the public. It argues that the field has become too obsessed with a specific "magic number" (95% confidence) and has forgotten to ask the real question: "Does this actually matter?"

Here is the breakdown of the paper using simple analogies:

1. The "Fuzziness" of the Crystal Ball (Uncertainty)

When you estimate a number, you aren't getting the "True Truth." You are getting a "Best Guess."

The Analogy: Imagine trying to guess the average height of everyone in a city by measuring just 50 people. If you picked a different 50 people, you'd get a slightly different average.
The Paper's Point: We need to measure how much our guess might wiggle if we picked different people. We do this using Standard Errors (how much the guess wiggles) and Confidence Intervals (a range where the true answer probably lives).
The Trap: Sometimes, the "fuzziness" is bigger than we think because we didn't account for the fact that the same person made multiple trips (repeated choices). It's like measuring the same person's height 10 times and pretending you measured 10 different people. That makes your guess look too precise when it's actually sloppy.

2. The "Magic 95%" Rule (Statistical Significance)

For a long time, scientists have agreed on a rule: If your "fuzziness" is small enough that there's less than a 5% chance your result is just luck, you call it "Statistically Significant." This is the famous p < 0.05 rule.

The Analogy: Imagine a security guard at a club. The rule is: "If there's a 95% chance this person is a VIP, let them in."
The Problem: The paper argues that the guard is too rigid.
- Big Data Bias: If you have a huge crowd (a massive dataset), even a tiny, meaningless difference can pass the 95% test. It's like the guard letting in a VIP who is only 1 inch taller than the average person. It's "significant" but useless.
- Small Data Bias: If you have a small crowd, a really important difference might get rejected because the "fuzziness" is high. It's like the guard kicking out a real VIP because they were wearing a hat that made them look shorter.
The Advice: Don't just look at the 95% line. Ask: "Is this effect big enough to change a policy?" If a new train line saves people 2 minutes a day, it might not be "statistically significant" in a small study, but it's still a great idea for the city.

3. The "Three Musketeers" of Testing (Hypothesis Tests)

When you want to prove your crystal ball is right, you use three different tools (tests) to check your work. The paper calls them the Likelihood Ratio, Wald, and Lagrange Multiplier tests.

The Analogy: Imagine you are testing a new recipe.
- Wald Test: You taste the soup now and guess if it needs more salt based on the current flavor. (Fast, but relies on assumptions).
- Likelihood Ratio: You cook the soup without salt, taste it, then cook it with salt, taste it again, and compare the two. (Slower, but more accurate).
- Lagrange Multiplier: You look at the pot before you added the salt and guess if the steam suggests it needs more. (Hard to do, but useful in some cases).
The Advice: The paper says the "Wald test" (the t-ratio most people use) is often too simple. If you have the time, use the "Likelihood Ratio" (comparing the full model to a restricted one) because it's more honest about the data.

4. The "Star" System (Reporting Results)

In many scientific papers, you see numbers with stars next to them: *, **, ***.

The Analogy: It's like a movie rating. *** means "Great," * means "Okay."
The Problem: The paper says this is dangerous. If you only see the stars, you don't know how great the movie is, or if the rating was based on a one-sided review or a two-sided one.
The Advice: Stop hiding behind stars. Show the actual numbers (the estimate and the standard error). Let the reader decide if the result is good enough. If you hide the numbers, you can't calculate the "confidence interval" (the range of truth).

5. The "Significance" vs. "Importance" Trap

This is the most important lesson.

Significance: "Is this result real, or just a fluke?" (Did the coin land on heads 10 times in a row by chance?)
Importance: "Does this result matter?" (If the coin lands on heads, does it change the outcome of the game?)
The Analogy: Imagine you are testing a new medicine.
- Significant but Useless: The medicine cures a headache in 0.001 seconds faster than a placebo. It is "statistically significant" (because you tested 1 million people), but it's useless to a patient.
- Not Significant but Vital: The medicine cures a headache in 10 minutes, but your sample size was small, so the math says "we aren't 95% sure." But if you ignore it, people suffer.
The Advice: In choice modelling (like transport planning), we need to care about Policy Importance. If a variable (like cost) makes sense logically, keep it in the model even if the math says it's "weak." Don't throw away a variable just because it didn't pass the 95% test.

Summary: What Should You Do?

The authors are telling choice modellers to:

Stop obsessing over the 95% line. It's an arbitrary rule that breaks with big or small data.
Be honest about the "fuzziness." Report the actual numbers (standard errors), not just stars.
Ask "So What?" A result can be statistically real but practically useless. Focus on whether the finding changes how we understand human behavior or helps make better policies.
Use better tools. If you have complex data (like people making many trips), use better math (like bootstrapping) to get the right "fuzziness" measurement.

In a nutshell: Don't let the math trick you into thinking a tiny, meaningless difference is a breakthrough, and don't let the math trick you into throwing away a potentially huge idea just because the sample size was small. Use your brain, not just your calculator.

Here is a detailed technical summary of the paper "Statistical significance in choice modelling: computation, usage and reporting" by Hess et al.

1. Problem Statement

The paper addresses the widespread over-reliance on, and frequent misinterpretation of, statistical significance (specifically 95% confidence levels and p-values) in the field of choice modelling. While statistical significance is ubiquitous, the authors argue that it is often misused due to:

Misunderstanding of concepts: Confusing the probability of evidence given a hypothesis with the probability of a hypothesis given evidence (the "fallacy of the transposed conditional").
Mechanical application: Blindly adhering to the 95% confidence threshold ( $p < 0.05$ ) without considering the behavioral or policy relevance of the findings.
Reporting imprecision: The use of "star measures" ( $*, **, ***$ ) and truncated p-values, which obscure the actual magnitude of uncertainty and prevent readers from reconstructing confidence intervals.
Field-specific complexities: Choice modelling involves unique challenges such as parameter transformations (e.g., Willingness-to-Pay), random heterogeneity, and repeated choice data, which standard econometric reporting often fails to address correctly.

2. Methodology

The paper employs a comprehensive review of statistical theory applied to Discrete Choice Models (DCMs), combining theoretical derivation with an empirical case study.

Theoretical Framework:
- Uncertainty Derivation: The authors detail the derivation of uncertainty measures from Maximum Likelihood Estimation (MLE), discussing the asymptotic covariance matrix ( $\Omega$ ). They compare three estimators: Classical (inverse Hessian), BHHH (outer product of gradients), and Robust (Sandwich estimator).
- Resampling Methods: They discuss bootstrapping as a non-parametric alternative to handle complex distributions and repeated choice data (panel data), emphasizing the need to resample at the individual level rather than the observation level to account for intra-individual correlation.
- Parameter Transformations: The paper applies the Delta method to derive standard errors for non-linear functions of parameters (e.g., Marginal Rates of Substitution) and discusses the complexities of deriving uncertainty for random coefficients (distributions of parameters).
- Hypothesis Testing: The authors analyze the "trinity" of tests (Likelihood Ratio, Wald, and Lagrange Multiplier), highlighting their asymptotic equivalence but noting their divergence in finite samples. They critically examine the formulation of Null ( $H_0$ ) and Alternative ( $H_1$ ) hypotheses, arguing for the use of one-sided tests when strong a priori sign assumptions exist (e.g., cost coefficients must be negative).
- Model Comparison: The paper distinguishes between nested models (where LR tests apply) and non-nested models (requiring information criteria like AIC/BIC or the Ben-Akiva & Swait test).
Empirical Application:
- Data: A revealed preference (RP) dataset from the DECISIONS project (University of Leeds), comprising 3,438 work trips by 358 individuals across six transport modes.
- Model: A Multinomial Logit (MNL) model with mode-specific time and cost parameters.
- Estimation: The authors estimated the model using classical, robust, and bootstrap (400 replications) methods. They compared standard errors, t-ratios, p-values (one-sided and two-sided), and confidence intervals (asymptotic vs. empirical bootstrap vs. Highest Posterior Density).

3. Key Contributions

A. Clarification of Statistical Concepts

Significance vs. Precision: The authors distinguish between statistical significance (rejecting the null that an effect is zero) and precision (the width of the confidence interval). A parameter can be highly significant (low p-value) but have a wide confidence interval, rendering it useless for policy decisions.
One-Sided vs. Two-Sided Tests: They argue that for parameters with known signs (e.g., cost), two-sided tests are inappropriate as they double the p-value, increasing the risk of Type II errors (failing to detect a real effect).
The "Trinity" of Tests: They provide guidance on when to prefer Likelihood Ratio (LR) tests over Wald (t-ratio) tests, noting that LR tests are generally more powerful and do not rely on the assumption of asymptotic normality of the estimator.

B. Reporting Standards

Critique of Star Measures: The paper strongly advises against using star measures ( $*, **, ***$ ) as a substitute for reporting standard errors or t-ratios. Stars hide the exact p-value and make it impossible to calculate confidence intervals.
Precision Requirements: Authors should report at least two significant digits for standard errors and t-ratios. Reporting $p < 0.001$ without the underlying t-ratio prevents readers from assessing the magnitude of the effect.
Language: The term "statistically significant" is discouraged. Instead, analysts should state that they "reject the null hypothesis of no effect at the X% confidence level."

C. Handling Choice Modelling Specifics

Repeated Choice Data: Standard errors are often underestimated in panel data if correlations are ignored. The paper advocates for robust standard errors (clustered by individual) or bootstrapping at the individual level.
Derived Measures: Uncertainty for Willingness-to-Pay (WTP) and other derived metrics must be computed using the Delta method or bootstrapping, not by simply dividing the standard errors of the numerator and denominator.
Identification: The paper warns that software returning a covariance matrix does not guarantee the model is identified; local optima and numerical instability can produce misleading standard errors.

4. Results from Empirical Example

The empirical analysis on the DECISIONS dataset yielded several critical insights:

Standard Error Discrepancies: Robust and bootstrap standard errors were significantly larger (often 2–3 times) than classical standard errors, highlighting that ignoring the panel nature of the data leads to over-confident inferences.
Divergent Test Outcomes: For some parameters (e.g., rail in-vehicle time), the choice of test (Classical t-ratio vs. Robust vs. Bootstrap) changed the conclusion regarding statistical significance at the 95% level.
Significance vs. Behavioral Relevance: The taxi in-vehicle time parameter was not statistically significant (failed to reject $H_0$ ) across all tests. However, the authors argue it should be retained due to its policy importance, demonstrating that statistical significance should not dictate model specification when behavioral theory suggests an effect exists.
Asymmetry in Confidence Intervals: Empirical bootstrap confidence intervals were often asymmetric around the Maximum Likelihood Estimate (MLE), particularly for parameters with skewed distributions (e.g., WTP). This challenges the validity of symmetric asymptotic confidence intervals ( $\hat{\beta} \pm 1.96\hat{\sigma}$ ) for non-linear functions or small samples.
Precision Variance: Even among parameters that were all significant at the 99% level, the width of the 95% confidence intervals varied drastically (e.g., from $\pm 15\%$ to $\pm 44\%$ of the estimate), illustrating that significance does not equate to precision.

5. Significance and Implications

This paper serves as a critical guide for the choice modelling community, urging a shift from "p-hacking" and mechanical significance testing toward a more nuanced interpretation of uncertainty.

Policy Impact: By emphasizing behavioral and policy significance over mere statistical significance, the paper prevents analysts from discarding important variables (like cost or time) simply because they fail a $p < 0.05$ test in a specific sample.
Methodological Rigor: It provides a roadmap for correctly computing uncertainty in complex scenarios (random coefficients, panel data, derived metrics), ensuring that confidence intervals and hypothesis tests reflect the true variability in the data.
Reporting Transparency: The call for precise reporting (avoiding stars, reporting full p-values or t-ratios) enhances reproducibility and allows readers to perform their own sensitivity analyses.
Future Directions: The authors suggest that while frequentist methods will remain dominant, the field should increasingly consider Bayesian approaches, which offer more intuitive probabilistic interpretations of hypotheses and parameters.

In conclusion, the paper argues that statistical significance is a tool for assessing the reliability of an estimate, not a binary gatekeeper for scientific truth. In choice modelling, where models are used for high-stakes policy decisions, understanding the magnitude and precision of an effect is far more critical than its binary status as "significant."