VisiFold: Long-Term Traffic Forecasting via Temporal Folding Graph and Node Visibility

Imagine you are trying to predict the traffic in a massive city for the next 24 hours, 36 hours, or even 48 hours into the future. This is a incredibly hard job for computers.

Usually, traffic forecasting models work like a stack of photo albums. To predict the future, they look at a photo of the city from 1 minute ago, then another from 2 minutes ago, and so on. They try to stitch these photos together to guess what happens next.

The Problem:
If you want to predict the traffic for 48 hours, you need a stack of photos that is 48 times higher.

The "Snapshot Stacking" Problem: The computer gets overwhelmed trying to hold all these photos in its memory. It's like trying to carry a stack of 1,000 books; eventually, you drop them.
The "Fragmentation" Problem: Because the computer looks at each photo separately, it struggles to connect the dots between "1 minute ago" and "48 hours from now." The story gets broken up.

Enter VisiFold, a new method that solves this by changing how the computer looks at the data.

The Big Idea: The "Time-Folding" Trick

Instead of stacking photos on top of each other, VisiFold uses a technique called Temporal Folding.

The Analogy: The Time-Traveling Backpack
Imagine every traffic sensor (a camera on a pole) has a magical backpack.

Old Way: The computer looks at the sensor's backpack, takes out a photo from 1 minute ago, then puts it back, then takes out a photo from 2 minutes ago. It does this for every single sensor, one by one.
VisiFold Way: The computer reaches into the backpack and pulls out all the photos from the last hour at once. It folds them all together into a single, super-dense "Time-Sandwich."

Now, instead of having 1,000 sensors with 1 photo each (a huge stack), the computer has 1,000 sensors, each holding a "Time-Sandwich" containing the whole history. The computer only needs to look at the sensors once to understand the whole timeline. This saves a massive amount of memory and processing power.

The Second Trick: "Node Visibility" (The Blindfold Game)

Even with the Time-Sandwich, looking at 1,000 sensors at once is still too much for a computer to handle efficiently.

The Analogy: The Classroom Group Project
Imagine a classroom of 1,000 students (sensors) trying to solve a puzzle.

The Old Way: Every student talks to every other student simultaneously. The room becomes chaotic, loud, and slow.
VisiFold's Way: The teacher (the computer) puts blindfolds on 80% of the students and tells the remaining 20% to form small groups of 10 to talk.
- Node-Level Masking: By "hiding" 80% of the sensors, the computer doesn't waste energy processing them. Surprisingly, the model learns better this way because it's forced to focus on the most important patterns rather than getting distracted by noise.
- Subgraph Sampling: The small groups work in parallel. While Group A solves their part, Group B solves theirs. This is much faster than everyone talking at once.

Why is this a Big Deal?

It's Lightning Fast: The paper says VisiFold trains 7 times faster and uses 4 times less memory than the best existing methods.
It's Smarter at Long Terms: Because it doesn't get overwhelmed by the "stack of photos," it can predict traffic much further into the future (up to 48 hours) with high accuracy.
It Doesn't Need a Map: Traditional models rely heavily on knowing exactly which roads connect to which. VisiFold is so good at finding patterns that it can ignore the strict road map and still figure out that "Sensor A" behaves like "Sensor B," even if they are far apart.

The Bottom Line

VisiFold is like upgrading from a librarian who has to walk to every shelf to find a book, to a librarian who can magically pull all the books off the shelves, fold them into a single compact guide, and hand it to you instantly.

It allows cities to plan for the future (like avoiding traffic jams before they happen) without needing supercomputers that cost millions of dollars. It's a smarter, leaner, and faster way to see the future of our roads.

1. Problem Statement

Long-term traffic forecasting (predicting traffic states far into the future, e.g., 24–48 steps ahead) is a critical component of intelligent transportation systems. However, existing methods face two fundamental bottlenecks when extending the prediction horizon:

Snapshot-Stacking Inflation: Traditional Spatial-Temporal Graph Neural Networks (STGNNs) and Transformer-based models treat time as a sequence of discrete snapshots. As the prediction horizon ( $T$ ) increases, the computational cost and memory consumption grow linearly or quadratically with $T$ due to the need to process a sequence of graphs.
Cross-Step Fragmentation: Current models typically decouple spatial and temporal modeling. They aggregate spatial information within a snapshot and propagate it across time steps via sequential modules. This leads to "fragmentation" of temporal dependencies, where information must pass through multiple intermediate representations, degrading accuracy over long horizons.
Scalability Issues: Large-scale road networks with thousands of nodes exacerbate these issues, making long-term forecasting computationally prohibitive.

2. Methodology: VisiFold

The authors propose VisiFold, a framework that rethinks the input representation and node interaction mechanisms to break temporal and spatial constraints.

A. Temporal Folding Graph (TFG)

Instead of treating time as a sequence of separate graph snapshots, VisiFold introduces a Temporal Folding Graph.

Concept: It "folds" the temporal dimension into the node attributes. For a specific node, the traffic signals across a sequence of $T$ time steps are compressed into a single, enriched attribute vector (a "TF-token").
Mechanism:
- Input: $X_{t-T+1:t} \in \mathbb{R}^{N \times T \times C}$ (Nodes $\times$ Time $\times$ Channels).
- Transformation: The channel dimension is squeezed, and the temporal sequence for each node is embedded into a single vector.
- Result: The model processes a single graph where each node contains the entire temporal history, rather than a sequence of graphs.
Benefit: This eliminates the need for cross-step message passing and temporal modules (like RNNs or sequential attention), reducing time and space complexity from $O(N \cdot g(T) + T \cdot h(N))$ to $O(h(N))$ .

B. Node Visibility

Even with TFG, large node counts ( $N$ ) in city-scale networks remain a bottleneck. VisiFold introduces Node Visibility to manage this, consisting of two techniques applied during training:

Node-Level Masking: A random subset of nodes (controlled by a mask ratio $r$ ) is completely removed from the encoder's view. This reduces the input size and prevents the model from relying on overly tight, local dependencies (shortcuts).
Subgraph Sampling: The remaining nodes are partitioned into smaller, randomly sampled subgraphs. This allows the Transformer encoder to process smaller batches in parallel, further reducing memory usage and increasing parallelism.

Inference: These masking and sampling steps are not applied during inference; the full graph is used to ensure complete information utilization.

C. Architecture & Embeddings

Embedding Fusion: The model concatenates four types of embeddings:
1. Token Embedding ( $E_x$ ): Linear projection of the folded temporal attributes.
2. Spatial Embedding ( $E_s$ ): Learnable vectors identifying each node.
3. Time-of-Day ( $E_{tod}$ ) & Day-of-Week ( $E_{dow}$ ): Cyclic temporal embeddings shared across all nodes.
Backbone: A standard Transformer Encoder (Multi-Head Self-Attention + Feed-Forward Network) processes the fused embeddings.
Head: An MLP predicts the future traffic states.

3. Key Contributions

Revisiting Representation: Identified that the standard "snapshot-stacking" paradigm inherently limits long-horizon forecasting due to decoupling and resource inflation.
Temporal Folding Graph (TFG): Proposed a novel tokenization strategy that compresses temporal dynamics into node attributes, enabling synchronized spatial-temporal modeling within a single graph.
Node Visibility Mechanism: Introduced node-level masking and subgraph sampling to handle large-scale graphs, acting as an implicit regularizer that improves robustness and reduces resource consumption.
Performance & Efficiency: Demonstrated that VisiFold achieves State-of-the-Art (SOTA) accuracy while drastically reducing computational costs.

4. Experimental Results

The model was evaluated on three real-world datasets: PEMS04, PEMS08, and SEATTLE, with prediction horizons of 24, 36, and 48 time steps.

Accuracy: VisiFold outperformed 12 strong baselines (including STGNNs, Transformers, and MLP-based methods like STID) across all datasets and horizons. It achieved the lowest RMSE, MAE, and MAPE in the majority of scenarios.
Resource Efficiency:
- Training Speed: Up to 17.8x faster than the previous best model (STAEformer) and 7.8x faster than the most efficient baseline (STID).
- Inference Speed: Up to 18.5x faster than STAEformer.
- Memory Usage: Reduced GPU memory consumption by 15.7x compared to STAEformer and 5.1x compared to STID.
Robustness to Masking: Remarkably, the model maintained or even improved performance when up to 80% of nodes were masked during training, indicating high redundancy in traffic data and the effectiveness of the regularization.

5. Significance and Impact

Breaking the Horizon Barrier: VisiFold effectively removes the computational constraints that previously limited traffic forecasting to short horizons, making reliable long-term prediction feasible.
Scalability: By decoupling the prediction horizon from the computational cost, the framework is scalable to massive urban networks without requiring prohibitive hardware resources.
Paradigm Shift: The paper challenges the necessity of explicit topological priors (adjacency matrices) and sequential temporal modules, showing that learning adjacency-insensitive representations via visibility mechanisms can yield better generalization and stability.
Practical Deployment: With inference times under one second and low memory footprints, VisiFold is suitable for real-time applications and edge deployment in intelligent transportation systems.

In conclusion, VisiFold represents a significant advancement in spatial-temporal forecasting by fundamentally altering how time and space are represented in the input, offering a highly efficient and accurate solution for long-term traffic prediction.

VisiFold: Long-Term Traffic Forecasting via Temporal Folding Graph and Node Visibility

The Big Idea: The "Time-Folding" Trick

The Second Trick: "Node Visibility" (The Blindfold Game)

Why is this a Big Deal?

The Bottom Line

1. Problem Statement

2. Methodology: VisiFold

A. Temporal Folding Graph (TFG)

B. Node Visibility

C. Architecture & Embeddings

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Memory Bear AI Memory Science Engine for Multimodal Affective Intelligence: A Technical Report

The Efficiency Attenuation Phenomenon: A Computational Challenge to the Language of Thought Hypothesis

Dynamic Fusion-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition in Conversations

Intelligence Inertia: Physical Principles and Applications

Session Risk Memory (SRM): Temporal Authorization for Deterministic Pre-Execution Safety Gates