Entropy-Driven Curriculum for Multi-Task Training in Human Mobility Prediction

Imagine you are trying to teach a robot how to predict where a person will go next in a busy city. You have millions of GPS tracks from thousands of people. Some people have very boring, predictable routines (like a commuter who goes Home → Work → Gym → Home every day). Others have chaotic, unpredictable lives (like a tourist visiting random spots, a delivery driver with erratic routes, or someone just wandering around).

The problem with standard AI training is that it throws all these different people's data into a giant blender and shuffles it randomly. It's like trying to teach a child to read by handing them a dictionary, a physics textbook, and a comic book all at once, in random order. The child gets overwhelmed, confused, and learns slowly.

This paper proposes a smarter way to train the AI, using two main ideas: a "Curriculum" (a lesson plan) and a "Multi-Task" approach (learning related skills at once).

Here is the breakdown in simple terms:

1. The "Curriculum": Learning from Easy to Hard

Instead of random shuffling, the authors created a Curriculum Learning system. Think of this like a video game. You don't start by fighting the final boss; you start with the tutorial level, then easy enemies, and slowly work your way up to the hard stuff.

How do they know what is "easy"?
They use a mathematical concept called Entropy (which basically measures "chaos" or "surprise").
- Low Entropy (Easy): A person who goes to the same coffee shop every morning at 8 AM. This is very predictable.
- High Entropy (Hard): A person who visits 20 different places in a random order. This is chaotic and hard to guess.
The Strategy: They calculate the "chaos score" for every person's history. They start training the AI only on the "Low Chaos" (predictable) people. Once the AI gets good at those, they slowly introduce the "Medium Chaos" people, and finally, the "High Chaos" ones.
The Result: The AI builds a strong foundation before it gets overwhelmed. The paper says this made the AI learn almost 3 times faster than the old random method.

2. The "Multi-Task" Approach: Learning the Whole Picture

Usually, AI models are told: "Just guess the next location."
The authors realized that to guess where someone is going, you also need to know how far they are going and which way they are facing.

The Analogy: Imagine you are trying to guess where your friend is going.
- Old Way: You just guess the destination (e.g., "The Mall").
- New Way: You guess the destination, AND you also guess, "They are walking 500 meters North."
Why it helps: Even if the AI isn't 100% sure about the exact building, knowing the direction and distance helps it narrow down the possibilities. It's like having three clues instead of one. These extra clues act as "training wheels" that keep the AI on the right track, making the final location guess much more accurate.

3. The "MoBERT" Model

The AI architecture they built is called MoBERT.

Think of it as a super-smart reader (based on a famous AI called BERT) that looks at a person's entire history of movements at once, rather than reading it one step at a time.
It looks at the Time (is it morning or night?), the Place (is it a park or a hospital?), and the Movement (how far and in what direction?).
It combines all these clues to make a prediction.

4. The Results: Beating the Competition

The authors tested this on a massive dataset of 100,000 people in Japan (the YJMob100K dataset).

Speed: The AI learned 2.92 times faster thanks to the "Curriculum" method.
Accuracy: It achieved the best results ever recorded on this specific test (beating the winners of the 2023 HuMob Challenge).
The "Zero-Shot" Superpower: The most impressive part? They trained the AI on data from one city only. Then, they tested it on three completely different cities it had never seen before.
- Usually, AI trained in Tokyo fails miserably in New York.
- But because this AI learned the fundamental patterns of human movement (thanks to the curriculum and multi-tasking), it worked almost as well in the new cities as models that were specifically trained on those cities. It's like learning to drive a car in one city and being able to drive perfectly in a totally different city without needing a new lesson.

Summary

The paper is about teaching an AI to predict human movement by:

Sorting the data from "boring/predictable" to "chaotic/random" so the AI learns step-by-step.
Asking the AI to solve three puzzles at once (Where? How far? Which way?) to help it understand the context better.
Result: A faster, smarter AI that can predict where people will go, even in cities it has never visited before.

Here is a detailed technical summary of the paper "Entropy-Driven Curriculum for Multi-Task Training in Human Mobility Prediction."

1. Problem Statement

Human mobility prediction aims to forecast future locations based on historical trajectory data. Despite the availability of big data and deep learning, the field faces two primary challenges:

Heterogeneous Data Complexity: Human mobility data varies significantly in complexity. Simple, repetitive routines (e.g., daily commutes) are easy to model, while irregular, sparse, or exploratory trajectories are difficult. Standard training methods (random shuffling) treat all data as equally difficult, leading to inefficient gradient updates, training instability, and suboptimal convergence.
Limited Supervision Signals: Most existing approaches focus solely on next-location prediction (a single-task objective). This neglects inherent mobility characteristics such as movement distance and direction, which provide valuable complementary information for learning realistic mobility patterns. Furthermore, many Multi-Task Learning (MTL) approaches rely on auxiliary tasks (e.g., activity type, transport mode) that require specific annotations not universally available in datasets.

2. Methodology

The authors propose a unified training framework integrating Entropy-Driven Curriculum Learning and Multi-Task Learning (MTL), built upon a custom Transformer architecture called MoBERT.

A. Entropy-Driven Curriculum Learning

Instead of random sampling, the training data is organized from simple to complex based on a quantitative measure of trajectory predictability.

Theoretical Basis: The approach relies on Fano's Inequality, establishing that low-entropy trajectories (highly regular) have a lower bound on prediction error and are fundamentally more learnable.
Complexity Metric: The authors propose a Normalized Lempel-Ziv (LZ) Mobility Entropy estimator ( $H_{norm-LZ}$ $H_{n or m - L Z}$ ).
- Trajectories are symbolized (flattened 2D coordinates to 1D).
- LZ compression is applied to count unique subsequences.
- Entropy is calculated based on the average phrase length and normalized to a range of $[0, 1]$ .
- Interpretation: Values near 0 indicate highly predictable routines; values near 1 indicate random/irregular movement.
Curriculum Pipeline:
1. Augmentation: Real trajectories are augmented via mirroring and rotation to increase data volume while preserving logic.
2. Sorting & Staging: Augmented trajectories are sorted by increasing $H_{norm-LZ}$ . Training proceeds in stages with increasing entropy and increasing prediction horizons (e.g., predicting 3 days $\to$ 7 days $\to$ 15 days).
3. Fine-tuning: The model is pre-trained on the curriculum and then fine-tuned exclusively on real (non-augmented) trajectories.

B. Multi-Task Learning (MTL)

To improve generalization without requiring new annotations, the framework introduces two universally available auxiliary tasks:

Distance Estimation: Predicting the Euclidean distance between consecutive points (discretized into classes: stationary, short, medium, long).
Direction Estimation: Predicting the movement direction (discretized into 9 classes: N, NE, E, etc., and stationary).

Loss Function: The total loss is a weighted sum: $L = L_{loc} + \lambda_1 L_{dist} + \lambda_2 L_{dir}$ . These tasks provide spatial constraints (bounds and orientation) that complement the primary location prediction.

C. Model Architecture: MoBERT

Base: An encoder-only Transformer based on BERT.
Input: 3D tensor $[Batch, Sequence, Embedding]$ containing spatial coordinates, timestamps, and semantic features (Day of week, time intervals, top POI categories).
Feature Interaction: A Multi-Head Self-Attention (MHSA) module fuses the 8 input features, capturing dynamic inter-dependencies (e.g., prioritizing spatiotemporal features during commutes).
Heads: The model shares the encoder but uses separate Feed-Forward Networks (FFNs) for the three tasks (Location, Distance, Direction).

3. Key Contributions

Entropy-Driven Curriculum Strategy: A principled method to quantify trajectory complexity using information theory (LZ compression) and organize training from simple to complex, addressing the heterogeneity of mobility data.
Universal Multi-Task Framework: Introduction of distance and direction prediction as auxiliary tasks. Unlike previous MTL approaches, these require no additional data annotations and are applicable to any mobility dataset.
MoBERT Architecture: A specialized BERT-like model for mobility that integrates multi-feature embeddings and attention-based feature interaction to capture complex spatiotemporal dependencies.
State-of-the-Art Performance: The framework achieves top results on the HuMob Challenge benchmarks, outperforming previous winners and demonstrating superior zero-shot generalization to unseen cities.

4. Experimental Results

The method was evaluated on the YJMob100K dataset (100k users, 75 days of GPS data) using the HuMob Challenge metrics: GEO-BLEU (spatial coverage similarity) and DTW (spatiotemporal alignment).

Performance Metrics:
- GEO-BLEU: 0.354 (State-of-the-Art; previous best was 0.344).
- DTW: 26.15 (State-of-the-Art; previous best was 26.22).
Convergence Speed: The entropy-driven curriculum accelerated training convergence by up to 2.92x compared to random sampling to reach target loss levels.
Ablation Studies:
- Adding semantics improved performance.
- Feature interaction modules provided significant gains.
- MTL (Distance + Direction) contributed most to GEO-BLEU improvements.
- Curriculum learning was most effective for reducing DTW (temporal/spatial alignment).
Cross-City Generalization (Zero-Shot):
- The model trained only on City A was tested on Cities B, C, and D without fine-tuning.
- It outperformed LP-BERT (trained on all cities) and matched the performance of the massive Llama-3-8B-Mob (a 1/6th the size of the LLM) on unseen cities.
- This suggests that architectural design and curriculum strategy are more critical for transferability than simply aggregating massive multi-city datasets.

5. Significance

This paper addresses the fundamental inefficiency in training mobility models by aligning the learning process with the intrinsic complexity of the data. By leveraging information theory to structure the curriculum and utilizing universally available geometric constraints (distance/direction) for multi-task learning, the authors demonstrate that:

Efficiency: Models can converge significantly faster and more stably.
Generalizability: Compact, well-structured models can generalize to unseen urban environments better than large models trained on diverse but potentially conflicting data.
Practicality: The approach requires no new data collection or annotation, making it highly scalable for real-world deployment in urban planning, transportation optimization, and epidemic modeling.