From GEV to ResLogit: Spatially Correlated Discrete Choice Models for Pedestrian Movement Prediction

Imagine you are standing on a busy street corner, waiting to cross the road. A self-driving car (an AV) is approaching. In the split second before you move, your brain makes a tiny, almost automatic calculation: Do I speed up? Do I slow down? Do I step left or right?

This paper is about teaching computers to understand that split-second decision-making process, specifically how pedestrians move when interacting with self-driving cars.

Here is the breakdown of the research using simple analogies.

The Problem: The "Grid" of Choices

The researchers didn't try to predict exactly where a pedestrian's foot will land (like predicting a specific coordinate on a map). Instead, they imagined a 3x3 tic-tac-toe board floating in front of the pedestrian.

The Center: Keep walking at the same speed and direction.
The Top Row: Slow down.
The Bottom Row: Speed up.
The Left/Right Columns: Step left or right.

Every time a pedestrian moves, they are essentially picking one of these 9 squares. The challenge is that these squares are neighbors. If you pick "Step Left," it's very similar to picking "Step Left and Slow Down." They are so close that they are almost the same choice.

The Old Way: The "Rigid Nesting" Models (GEV)

For a long time, statisticians tried to solve this using "Spatial GEV" models. Think of these models as architects trying to build a house with pre-fabricated rooms.

They decided in advance: "Okay, the 'Left' square and the 'Left-Slow' square must be in the same 'room' (nest) because they are neighbors."
They built a rigid structure to force the computer to understand that these choices are related.
The Result: It worked a little bit better than doing nothing, but it was like trying to fit a round peg in a square hole. The "rooms" were too rigid. The real world is messy, and the pre-built rooms didn't quite capture how people actually make those tiny, fluid adjustments. The improvement was barely noticeable.

The New Way: The "Smart Tutor" (ResLogit)

The researchers then tried a new approach called ResLogit. Think of this not as an architect, but as a smart tutor.

The Base Lesson: First, the tutor teaches the computer the basic rules of walking (e.g., "If a car is coming fast, slow down"; "If you are far from your destination, keep going"). This is the "Linear Utility" part—it's the human-readable logic.
The Correction: Then, the tutor looks at thousands of real-world examples and says, "Wait, the basic rules aren't perfect. When the car is this close and the pedestrian is this tired, they actually tend to do this specific weird move."
The Learning: The computer learns these tiny "corrections" automatically. It doesn't force the choices into rigid rooms; it learns that "Step Left" and "Step Left-Slow" are neighbors because it saw people make those mistakes or choices together in the data.

The Results: Why the "Smart Tutor" Won

When they tested these two approaches:

The Rigid Architect (GEV): Made very few mistakes, but mostly just guessed the most popular moves. It didn't really understand the subtle differences between the 9 squares.
The Smart Tutor (ResLogit): Got much better at predicting the right move.
- The "Safe Mistake" Analogy: If the computer guesses wrong, the "Rigid Architect" might guess you will run across the street when you actually just stepped back. That's a dangerous, big error.
- The "Smart Tutor," however, if it guesses wrong, it usually guesses a neighbor. For example, if you actually stepped "Left," the computer guessed "Left-Slow." That is a tiny, harmless error. In the world of self-driving cars, knowing you will step slightly left is much safer than not knowing at all.

The Big Takeaway

The paper proves that for high-speed, crowded situations where people make tiny, split-second adjustments, letting the computer learn the patterns from data (ResLogit) works better than forcing the computer to follow strict, pre-written rules (GEV).

However, the best part is that the "Smart Tutor" still explains why it made its decision. It didn't become a "black box." It still says, "I predicted you would slow down because the car is close," but it added a little extra "nudge" based on what it learned from real people.

In short: To teach a robot how humans walk near cars, don't just give it a rigid map. Give it a set of basic rules, and then let it learn from the messy, real-world experience of how people actually move.

1. Problem Statement

The paper addresses the challenge of predicting high-frequency pedestrian movement decisions (next-step choices) when interacting with Autonomous Vehicles (AVs).

The Core Issue: In dense, structured choice sets (e.g., a grid of possible speed and heading adjustments), movement alternatives are inherently correlated. Neighboring options (e.g., slight deceleration vs. moderate deceleration) share unobserved attributes, violating the Independence of Irrelevant Alternatives (IIA) assumption of the standard Multinomial Logit (MNL) model.
The Gap: While traditional Generalized Extreme Value (GEV) models (like Nested Logit) attempt to model this correlation via analyst-specified structures (contiguity matrices or distance decay), these structures can be restrictive and weakly identified in dense, symmetric grids. Conversely, modern deep learning trajectory predictors often lack behavioral interpretability and explicit substitution patterns.
Objective: To evaluate whether classic spatial GEV models or a hybrid learning-based approach (ResLogit) better captures proximity-induced correlation in a discretized pedestrian movement grid while maintaining behavioral interpretability.

2. Methodology

Data and Choice Set Construction

Datasets: Naturalistic pedestrian-AV encounters from nuScenes and Argoverse 2.
Decision Interval: 1-second steps.
Choice Set: A 3×3 spatial grid representing the next step, defined by two dimensions:
1. Speed Adjustment: Decelerate, Maintain, Accelerate.
2. Heading Change: Left, Straight, Right.
- This results in 9 discrete alternatives per decision instant.
Variables: Inputs include kinematic data, AV-pedestrian distance ( $1/D_t$ ), collision risk proximity (Front and Rear), and attraction-to-destination metrics (distance and angular deviation).

Model Formulations

The study compares five models:

Multinomial Logit (MNL): Baseline model assuming independent errors.
Spatial GEV Models: Four closed-form specifications designed to handle spatial correlation:
- SCL (Spatially Correlated Logit): Uses a contiguity matrix to allocate alternatives to overlapping nests.
- GSCL (Generalized SCL): Replaces binary contiguity with a distance-based allocation rule (correlation decays with separation).
- SCNL (Spatially Correlated Nested Logit): Allows multiple nesting coefficients but uses analyst-defined contiguity.
- GSCNL (Generalized SCNL): Estimates allocation patterns from data using a pseudo-logit form rather than fixed rules.
ResLogit (Residual Neural Network Logit): A hybrid model.
- Structure: Starts with a linear systematic utility ( $V$ ) derived from the MNL specification.
- Correction: Applies a sequence of residual neural network layers to learn non-linear cross-effects and unobserved structures ( $g$ ).
- Final Utility: $U = V + g$ .
- Benefit: Retains the interpretability of the linear utility component while using data-driven learning to capture complex correlation structures.

Estimation

GEV Models: Estimated via Maximum Likelihood in GAUSS.
ResLogit: Estimated in Python using Stochastic Gradient Descent (Adam optimizer) with Bayesian optimization for hyperparameter tuning.

3. Key Results

Model Fit and Performance

Spatial GEV Models: Provided only marginal improvements over the MNL baseline.
- Mean Log-Likelihood (LL) improved from -2.147 (MNL) to -2.137 (best GEV).
- AIC decreased slightly, but the gain was negligible.
- Reasoning: In a dense 3×3 grid, analyst-specified linkage rules (contiguity/distance) are weakly identified. The overlapping nests distribute correlation too thinly to significantly alter probability allocation.
ResLogit: Achieved a substantially better fit.
- Mean LL improved to -1.716 (a significant drop of ~0.43 per observation compared to MNL).
- AIC dropped to 6,366 (vs. ~7,900 for others).

Predictive Behavior and Error Analysis

Confusion Matrices:
- GEV models concentrated predictions on a few dominant classes, similar to MNL.
- ResLogit produced errors that were behaviorally coherent: misclassifications were concentrated among neighboring grid cells (e.g., predicting a slight turn instead of a sharp turn). This reflects the true nature of pedestrian micro-adjustments.
Accuracy Metrics (ResLogit on Test Set):
- Top-1 Accuracy: ~32% (expected difficulty in exact cell prediction due to micro-variability).
- Top-3 Accuracy: ~67% (the model successfully assigns high probability to the correct local neighborhood).

Behavioral Interpretability

Despite the neural network component, the linear utility coefficients in ResLogit remained interpretable:

Goal Orientation: Strong negative coefficients for destination distance and angular deviation ( $\beta_{ddist}, \beta_{ddir}$ ), confirming pedestrians prioritize direct paths.
Interaction Pressure:
- Frontal Risk: Increases utility of deceleration.
- Rear Risk: Decreases utility of acceleration (pedestrians do not rush when the threat is behind).
- Proximity: Closer AV distance increases the propensity to change speed (accelerate or decelerate) rather than maintain.

4. Key Contributions

Empirical Evaluation of GEV in Dense Grids: The study demonstrates that in high-frequency, dense spatial choice sets (3×3 grids), traditional analyst-specified GEV structures (SCL, GSCL, etc.) offer limited value over MNL because the correlation topology is difficult to identify and the "nesting" geometry becomes restrictive.
Validation of ResLogit for Motion Prediction: It establishes ResLogit as a superior alternative that captures proximity-induced correlation more effectively than fixed GEV structures. Crucially, it does so without sacrificing interpretability, as the linear utility backbone remains intact.
Behavioral Coherence in Errors: The paper highlights that for AV planning, the structure of errors matters more than exact cell prediction. ResLogit's tendency to err locally (confusing adjacent cells) is safer and more realistic for AV decision-making than errors that jump across the grid.
Hybrid Framework: It bridges the gap between econometric discrete choice models (interpretability) and deep learning (flexibility), offering a blueprint for future motion prediction that is both data-driven and theoretically grounded.

5. Significance and Implications

For Autonomous Driving: The results suggest that AV planners should rely on models that produce locally coherent probability distributions rather than just maximizing exact trajectory accuracy. ResLogit provides a robust baseline for yielding and collision avoidance decisions.
For Behavioral Modeling: The study challenges the assumption that complex spatial correlation structures (like GEV) are always necessary. In dense, symmetric action spaces, learning-based residual corrections may be a more efficient way to capture substitution patterns.
Future Work: The authors note limitations, including the myopic (single-step) nature of the model and the lack of multi-agent interaction (other road users). Future extensions should incorporate temporal dependence and coupled pedestrian-vehicle dynamics.

Conclusion: The paper concludes that for dense, high-frequency pedestrian movement prediction, learning-based residual corrections (ResLogit) outperform analyst-specified spatial GEV structures in both fit and behavioral coherence, while successfully preserving the interpretability required for safety-critical AV applications.