A Spatio-temporal Graph Network Allowing Incomplete Trajectory Input for Pedestrian Trajectory Prediction

Imagine you are a robot trying to walk through a busy coffee shop. Your job is to predict where the customers will be in the next few seconds so you don't bump into them.

Most robot brains (algorithms) today have a very strict rule: "If I can't see you clearly for the entire time you've been in my view, I will ignore you completely."

This is a problem. In a real coffee shop, people get blocked by pillars, other people, or the robot's own body. If a customer steps behind a counter for a second, a standard robot stops tracking them. It thinks, "Oh, they vanished! I'll just pretend they don't exist." This is dangerous because that person might step right in front of the robot a moment later.

This paper introduces a new robot brain called STGN-IT that solves this problem. Here is how it works, using some simple analogies:

1. The "Ghost" Problem (Incomplete Trajectories)

Imagine you are playing a game of tag, but every time a player hides behind a tree, the referee erases them from the scoreboard. The other players stop looking for them.

Old Way: If a pedestrian is hidden (occluded), the robot deletes them from its memory.
STGN-IT's Way: The robot says, "I can't see you right now, but I know you were there a second ago. I will mark your spot as 'Ghost Mode' and keep guessing where you might go." It doesn't delete the person; it just marks them as "temporarily invisible."

2. The "Smart Map" (Occupancy Grid)

Most robots use a fancy, hand-drawn map of the world. But in a real, messy environment, you need a map that updates itself instantly.

The Analogy: Think of a foggy window. If you wipe a spot on the glass, you see the world clearly. STGN-IT uses a "Point Cloud" (like a 3D laser scan) to automatically create a "Foggy Window Map" (Occupancy Grid). It doesn't need a human to tell it where the walls are; it just sees the obstacles (walls, chairs, pillars) and adds them to its mental map instantly.

3. The "Two-Step Dance" (The Prediction Process)

STGN-IT doesn't just guess once; it dances in two steps to get it right.

Step 1: The Wild Guess. The robot looks at where people are walking and guesses where they will go next, ignoring the walls for a moment.
Step 2: The Reality Check. The robot looks at its "Foggy Window Map." It sees, "Oh, I predicted this person would walk straight into a wall!" So, it adds the wall into its calculation as a "player" in the game. It re-runs the prediction, thinking, "Okay, since there's a wall here, the person will probably turn left instead."

4. The "Grouping" Trick (Clustering)

When there are 50 people in a room, it's hard to track who is interacting with whom.

The Analogy: Imagine a crowded dance floor. Instead of trying to track everyone individually, STGN-IT uses a "Clustering" algorithm to group people who are close together or moving together. It's like saying, "Okay, that group of three is moving as a unit," which makes it much easier for the robot's brain to understand the flow of traffic.

5. The "Code" for Hiding

How does the robot know the difference between a person who is actually standing at the center of the room (0,0) and a person who is just hidden behind a pillar?

The Solution: STGN-IT uses a special "ID Card" (Encoding). When a person is hidden, the robot doesn't just put their coordinates at zero; it attaches a special tag that says, "I am here, but I am currently invisible." This prevents the robot from getting confused and thinking the person magically teleported to the center of the room.

Why Does This Matter?

The authors tested this on a dataset called STCrowd, which simulates a robot's view (where things get blocked easily).

Old Robots: When people got blocked, the robots stopped predicting them, leading to potential collisions.
STGN-IT: Even when people were partially hidden, STGN-IT kept predicting their path smoothly. It was better at avoiding walls and other people than any other robot brain tested.

In short: STGN-IT is a robot navigator that doesn't panic when it loses sight of someone. It keeps a mental note of "ghosts," checks the real-time map for walls, and uses a two-step thinking process to predict exactly where people will go, making it much safer for robots to walk among humans.

1. Problem Statement

Pedestrian trajectory prediction is critical for mobile robot navigation in human-robot coexistence environments. However, existing algorithms face two significant limitations:

Requirement for Complete Data: Most state-of-the-art (SOTA) algorithms require historical trajectories to be complete. If a pedestrian is occluded (unobservable) in even a single past frame, their trajectory is deemed "incomplete," and the algorithm fails to predict their future path.
Viewpoint Mismatch: Most algorithms are trained on top-down datasets (e.g., ETH, UCY) where occlusion is rare. However, mobile robots typically operate with egocentric views (cameras/LIDAR on the robot), where occlusion is frequent.
Safety Risks: Current approaches often use a "filtration mode," ignoring occluded pedestrians. This is unsafe for robots, as a hidden pedestrian could suddenly appear and cause a collision. A safer alternative is "pad mode" (predicting even with missing data by padding positions with zeros), but existing models suffer severe performance degradation in this mode because they misinterpret the zero-padding as the pedestrian moving to the origin $(0,0)$ .

2. Methodology: STGN-IT

The authors propose STGN-IT (Spatio-Temporal Graph Network for Incomplete Trajectories), a two-stage prediction framework designed to handle incomplete inputs and environmental obstacles.

A. Core Architecture

The network consists of four main modules and performs two sequential predictions:

Spatio-Temporal Graph Construction:
- Nodes represent pedestrians and obstacles; edges represent correlations.
- Nodes encode position ( $X_t$ ) and velocity ( $\Delta X_t$ ).
- DBSCAN Clustering: Used to reorder nodes in the adjacency matrix. This ensures that interacting agents (pedestrians and nearby obstacles) are placed adjacently in the matrix, facilitating feature extraction by Graph Convolutional Networks (GCNs).
Observation State Encoding:
- To prevent the network from misinterpreting occluded positions (padded with $[0,0]$ ) as actual movement to the origin, a specific encoding scheme is introduced.
- Two vectors, $No^i_t$ (node observation state) and $Eo^i_t$ (edge observation state), are generated based on whether a node/edge is observable.
- These vectors are combined with node/edge features via fully connected layers and Hadamard products to create encoded features that distinguish between "missing data" and "actual position."
Trajectory Prediction Module:
- Compensation: Two GRU networks compensate for missing position information by utilizing features from previous frames.
- Feature Extraction: A Spatio-Temporal Graph Convolution Network (STGCN) and a Time-Extrapolator Convolution Network (TECN) extract spatial and temporal features.
- Output: A Bi-GRU and Multi-Layer Perceptron (MLP) decode the features to output predicted displacements and final positions.
Obstacle Addition Module (Two-Stage Prediction):
- First Prediction: The network predicts trajectories using only pedestrian data.
- Obstacle Integration: Based on the first prediction and an Occupancy Grid Map (automatically generated from point cloud data), obstacles near the predicted paths are identified.
- Second Prediction: These obstacles are added as new nodes to the spatio-temporal graph. The network performs a second prediction, now accounting for static environmental constraints, significantly improving accuracy.

B. Key Distinctions

Input Flexibility: Unlike SOTA models, STGN-IT accepts trajectories where a pedestrian is observable in the latest frame and has been visible for at least 2 of the past 8 frames.
Environment Awareness: It utilizes occupancy grid maps rather than manually labeled semantic maps, making it more adaptable to real-world robot sensors.

3. Key Contributions

STGN-IT Architecture: A novel spatio-temporal graph network combining Graph Convolutional Networks (GCN), GRUs, and a specialized encoding method to handle incomplete trajectory inputs without performance collapse.
Observation State Encoding: A mechanism to explicitly encode the "visibility" of pedestrians, preventing the network from confusing occluded positions with the origin $(0,0)$ .
Two-Stage Obstacle Integration: A pipeline that dynamically adds static obstacles as graph nodes based on initial predictions and occupancy maps, refining the trajectory to avoid collisions.
Evaluation Paradigm Shift: The authors advocate for "pad mode" (training and testing with incomplete data) over "filtration mode" as a more realistic and safer metric for mobile robot navigation.

4. Experimental Results

The model was evaluated on the STCrowd (STC) dataset, a dataset specifically designed for egocentric views with 3D LIDAR data. A modified version, STC-c, was created by randomly removing 10% of samples to simulate severe occlusion.

Quantitative Performance:
- STGN-IT achieved the lowest Average Displacement Error (ADE) and Final Displacement Error (FDE) across all three evaluation conditions:
  - f-f (Filtration mode, complete data only).
  - p-p (Pad mode, incomplete data allowed).
  - STC-c, p-p (Pad mode with artificially increased missing data).
- Robustness: While SOTA algorithms (e.g., Social-STGCNN, SSAGCN) saw performance degradation of 25% to nearly 100% when moving from complete to incomplete data, STGN-IT's degradation was only ~15%.
- Comparison: In the "p-p" condition, STGN-IT outperformed the second-best algorithm (STIGCN) by a significant margin (e.g., ADE 0.35 vs. 0.46).
Ablation Study:
- Removing the clustering process increased error by ~20%, confirming the importance of node ordering.
- Removing observation state encoding caused significant performance drops in pad mode, proving its necessity for handling occlusion.
- Removing obstacle nodes reduced accuracy, highlighting the value of the two-stage prediction.
Qualitative Analysis:
- Visual results show that STGN-IT successfully predicts trajectories that avoid static obstacles (unlike SOTA models which often predict collisions) and handles pedestrian interactions (e.g., bypassing stopped groups) more smoothly.
- In scenes with heavy occlusion, STGN-IT produced smooth, reasonable trajectories, whereas other models produced unstable or non-existent predictions.

5. Significance

This paper addresses a critical gap between academic trajectory prediction research and practical mobile robot deployment.

Real-World Applicability: By handling incomplete trajectories and utilizing occupancy grid maps, STGN-IT is directly applicable to robots operating in dynamic, occluded environments.
Safety: The shift from "filtration mode" to "pad mode" ensures that robots do not ignore occluded pedestrians, thereby reducing the risk of collisions.
Efficiency: The model can predict trajectories 1.2 seconds after a pedestrian is first observed (vs. 3.2 seconds for others), allowing for faster reaction times in navigation planning.

In conclusion, STGN-IT represents a robust advancement in pedestrian trajectory prediction, specifically tailored for the constraints and safety requirements of mobile robotics in human-populated environments.