Mobility-Embedded POIs: Learning What A Place Is and How It Is Used from Human Movement

Imagine you are trying to understand a new city. You have two ways to learn about a specific place, like a coffee shop:

The Brochure Method (Text): You read the sign on the door. It says "Coffee Shop," "Open 7 AM," and "Located near the park." This tells you what the place is supposed to be.
The Detective Method (Movement): You stand outside for a week and watch who comes in, when they come, how long they stay, and what they do. You notice that on Tuesday mornings, it's packed with people rushing in for a quick espresso, but on Friday nights, it's empty. This tells you how the place is actually used.

Most computer systems for mapping cities only use the Brochure Method. They read the text and assume that's the whole story. But as the paper explains, this misses the real personality of a place. Two coffee shops might have the exact same sign, but one is a "grab-and-go" spot for commuters, while the other is a "work-from-home" hub where people sit for hours.

The Problem:
Existing AI models are great at reading the brochure but terrible at understanding the behavior. They can tell you a place is a "gym," but they don't know if it's a 24-hour 24/7 fitness center or a yoga studio that only opens on weekends. They also struggle with "ghost places"—locations that have a sign but are actually closed, or new places that haven't been written about yet.

The Solution: ME-POIs (The "Mobility-Embedded" Detective)
The authors created a new system called ME-POIs. Think of it as a super-smart detective that combines the brochure with the detective work.

Here is how it works, using a simple analogy:

1. The "Visit Diary" (The Encoder)

Imagine every time someone visits a place, they write a tiny diary entry: "I arrived at 8:00 AM, stayed for 15 minutes, and left."
The system reads millions of these diary entries. It doesn't just look at the place; it looks at the story of the visits. It uses a special "Transformer" brain (the same kind of AI that powers chatbots) to understand the patterns in these stories.

2. The "Group Hug" (Contrastive Learning)

Now, imagine the AI has a giant whiteboard. It wants to create one perfect "ID card" for the coffee shop.

It takes all the diary entries for that coffee shop and tries to squeeze them into one single ID card.
It makes sure this ID card looks nothing like the ID cards for the other coffee shops nearby.
The Magic: By forcing the AI to agree on what all these different visits have in common, it learns the true function of the place. It realizes, "Ah, this place is always busy at 8 AM and empty at 2 PM. That's its personality."

3. The "Neighborhood Watch" (Solving the Sparse Problem)

Here is the tricky part: What if a place is very new or very quiet? Maybe only 5 people have visited it. The AI doesn't have enough diary entries to understand it. This is called the "Long Tail" problem (the long list of unpopular or new items).

The paper introduces a clever trick called Multi-Scale Distribution Transfer.

The Analogy: Imagine a quiet, new bakery on a street. It has no customers yet. But right next door is a famous, busy bakery.
The AI looks at the famous bakery and says, "Okay, this street is a 'breakfast street.' People come here at 8 AM."
It then "borrows" this pattern and applies it to the quiet bakery. It says, "Even though we haven't seen many people at the quiet bakery yet, because it's next to the busy one, it probably follows similar rules."
This allows the AI to make smart guesses about places it has never really seen before, just by looking at their neighbors.

4. The "Hybrid Brain" (Text + Movement)

Finally, the system takes the "Brochure" (the text description) and the "Detective Work" (the movement data) and smashes them together.

The text tells it: "This is a coffee shop."
The movement tells it: "This coffee shop is a high-speed commuter stop."
The Result: A super-accurate digital twin of the place that knows both its name and its real-life behavior.

Why Does This Matter?

The authors tested this on five real-world tasks, like predicting if a store is permanently closed, guessing its price level, or figuring out its opening hours.

The Result: The new system beat all the old ones.
The Surprise: Even when they removed the text (the brochure) and only used the movement data, the system was sometimes better than systems that only read text. This proves that how people move is just as important as what the sign says.

In Summary:
This paper teaches computers to stop just reading the menu and start watching the customers. By combining the static description of a place with the dynamic flow of human movement, ME-POIs creates a much smarter, more accurate map of our world—one that understands not just where things are, but who they are and how they live.

1. Problem Statement

The paper addresses a critical gap in Point-of-Interest (POI) representation learning. Existing approaches generally fall into two categories, both of which have limitations:

Static Text-Based Models: These rely on Large Language Models (LLMs) and static metadata (e.g., category, address, description) to learn POI "identity" (what a place is). However, they fail to capture dynamic "function" (how a place is actually used), cannot handle missing/outdated metadata (common for new POIs), and often conflate functionally distinct locations with similar descriptions (e.g., a busy chain coffee shop vs. a quiet local café).
Mobility-Based Models: These use trajectory data to predict the next location. While they capture movement regularities, their embeddings are context-dependent (optimized for sequence prediction) rather than POI-centric. They often fail to distinguish intrinsic differences between places that share similar movement patterns (e.g., a gym and a bar both visited after work).

Core Hypothesis: A robust, general-purpose POI representation must encode both Identity (static attributes) and Function (dynamic usage patterns derived from human mobility). The paper argues that POI function is a missing but essential signal for generalizable representations.

2. Methodology: ME-POIs Framework

The authors propose Mobility-Embedded POIs (ME-POIs), a pretraining framework that augments static text embeddings with large-scale human mobility data. The architecture (illustrated in Figure 2 of the paper) consists of five key components:

A. Visit Sequence Encoder

Input: User visit sequences $s = (v_1, \dots, v_L)$ , where each visit $v_i$ includes POI coordinates ( $x_i$ ), arrival time ( $t^a_i$ ), and departure time ( $t^d_i$ ).
Feature Encoding:
- Spatial: Uses a multiscale location encoder (Space2Vec) to capture local and regional context.
- Temporal: Uses Time2Vec to separately encode arrival times and durations.
Sequence Modeling: The concatenated features are fed into a Transformer Encoder with sinusoidal positional encodings to capture temporal dependencies and co-visitation patterns within a user's trajectory. This produces contextualized visit embeddings ( $h_i$ ).

B. Global POI Alignment (Contrastive Learning)

Objective: To learn a global, context-independent embedding ( $z^{ME}_p$ ) for each POI $p$ .
Mechanism: The model treats the visit embedding $h_i$ (from a visit to POI $p$ ) as a positive pair with the global prototype $z^{ME}_p$ . It minimizes the InfoNCE loss against prototypes of other POIs in the batch.
Result: This forces the global prototype to act as a "functional centroid," aggregating usage patterns across diverse users and times while suppressing noise from individual schedules.

C. Multi-Scale Distribution Transfer (Addressing Sparsity)

Challenge: Long-tail POIs (rarely visited) suffer from data sparsity, making their embeddings unreliable.
Solution: A mechanism to propagate temporal visit patterns from frequent "Anchor POIs" to sparse POIs.
- Anchors: Top- $k$ POIs with high visit counts.
- Transfer: Empirical visit distributions (e.g., hourly activity over a week) are aggregated from anchors to sparse POIs using a multi-scale Gaussian kernel (capturing local neighborhood and broader district trends).
- Loss: An auxiliary KL Divergence loss ( $L_{KL-sparse}$ ) forces the sparse POI embedding to predict this transferred distribution.

D. Direct Supervision for Anchors

Anchor POIs are directly supervised to ensure their global prototypes faithfully encode empirically observed temporal usage patterns via a similar KL Divergence loss ( $L_{KL-anchor}$ ).

E. Text Alignment

To integrate static semantics, the mobility embeddings ( $z^{ME}_p$ ) are aligned with text embeddings ( $z^{text}_p$ ) derived from LLMs (using GeoLLM-style prompts).
A linear projection aligns the text embedding space with the mobility space, maximizing cosine similarity ( $L_{text-align}$ ). This ensures the final representation captures both semantic identity and dynamic function.

Total Loss Function:
$\mathcal{L} = \mathcal{L}_{ME-POI} + \lambda_a \mathcal{L}_{KL-anchor} + \lambda_s \mathcal{L}_{KL-sparse} + \lambda_t \mathcal{L}_{text-align}$

3. Key Contributions

ME-POIs Framework: A novel architecture that fuses static text embeddings with mobility-derived signals to learn POI-centric representations capturing both identity and function.
New Learning Objective: A contrastive learning paradigm that aligns visit-level embeddings with global POI prototypes, moving away from trajectory-prediction objectives toward POI-function objectives.
Sparsity Solution: A multi-scale distribution transfer mechanism that effectively handles long-tail POIs by leveraging temporal patterns from nearby, data-rich locations.
Comprehensive Evaluation: Introduction of five new map enrichment tasks to rigorously test POI representations.

4. Experimental Results

The model was evaluated on two large-scale datasets (Los Angeles and Houston) across five downstream tasks:

Weekly Opening Hours Prediction (Temporal pattern)
Permanent Closure Detection (Static status)
Visit Intent Classification (User interest)
Busyness Estimation (Foot traffic)
Price Level Classification (Socioeconomic attribute)

Key Findings:

Augmentation: Adding ME-POIs to strong text-only baselines (e.g., OpenAI, Gemini, MPNET) consistently improved performance across all tasks.
- Example: Up to 81.9% improvement in F1 for Visit Intent and 75.1% for Price Level classification.
- Example: 24.7% reduction in MAE for Busyness estimation.
Mobility vs. Text: ME-POIs trained without text alignment (mobility-only) outperformed many text-only models on specific tasks (e.g., surpassing Gemini on price level classification), proving that mobility data contains rich functional signals often missing in text.
Mobility vs. Mobility: ME-POIs significantly outperformed all state-of-the-art mobility baselines (e.g., POI2Vec, TrajGPT, CTLE). This confirms that optimizing for POI-centric function is superior to optimizing for trajectory prediction.
Sparsity Handling: Ablation studies showed that the distribution transfer mechanism significantly improved performance for sparse POIs, narrowing the performance gap between anchor and sparse locations.

5. Significance and Impact

Paradigm Shift: The paper demonstrates that for geospatial applications, understanding how a place is used (function) is as critical as knowing what it is (identity).
Practical Application: The framework directly addresses real-world challenges in automated map maintenance (e.g., detecting closed businesses, updating opening hours) and location recommendation, where static metadata is often outdated or incomplete.
Generalizability: The approach is not limited to POIs; the authors suggest it can be extended to other geospatial objects like road segments and administrative boundaries, highlighting the potential of mobility-informed foundation models for the broader field of GeoAI.

In conclusion, ME-POIs establishes that integrating dynamic human mobility data with static semantic models creates a more robust, accurate, and generalizable representation of the physical world than either modality can achieve alone.