Quantifying and extending the coverage of spatial categorization data sets

Imagine you are trying to map out the entire world of "where things are." You want to know how people in different countries describe the position of a cup on a table, a bird in a cage, or a shadow on a wall.

For decades, researchers have used a specific "photo album" called the TRPS (Topological Relations Picture Series) to do this. It has 71 pictures showing objects in various positions. But here's the problem: The album is incomplete. It's like trying to map the entire ocean using only a few pictures of the shoreline. It misses vast areas of the "ocean" of spatial language, especially for languages other than English.

This paper is about how the authors used Artificial Intelligence (specifically Large Language Models or LLMs) to fix this map, fill in the missing pieces, and figure out which new pictures and languages are most important to add next.

Here is the breakdown of their approach using simple analogies:

1. The Problem: The "Missing Puzzle Pieces"

Think of the existing 71 pictures as a puzzle that only shows the edges. Researchers know there are many more ways to describe space (like "among," "under," "left of," or "outside"), but they don't have pictures for all of them.

The Challenge: To make a complete map, they need to add hundreds of new pictures and test them in dozens of languages. Doing this by hand (hiring humans to draw pictures and label them) is too slow and expensive.

2. The Solution: The "AI Intern"

The authors decided to use an AI (specifically a model called Gemini) as a super-fast research assistant.

The Experiment: They showed the AI 220 different pictures (the old 71 plus 150 new ones) and asked it: "If you were a native speaker of Spanish, Chinese, or French, what word would you use to describe this picture?"
The Test: They checked if the AI's answers matched what real humans would say.
The Result: The AI was surprisingly good! It agreed with human answers about 80–90% of the time. It's not perfect, but it's accurate enough to be a reliable "draftsman."

3. The Strategy: The "Coverage Map"

Now that they had an AI that could guess labels for any language and any picture, they needed a way to decide which new pictures to actually test with real humans. They didn't want to just add random pictures; they wanted to fill the "gaps" in the map.

They used a concept called Coverage, which is like a net catching fish:

Imagine the "universe of all possible spatial scenes" is a huge ocean full of different fish (scenes).
The old 71 pictures are a small net that only catches the fish near the shore.
The goal is to cast a bigger net that catches fish from the deep ocean, the reefs, and the open sea.
The AI helped them simulate casting nets in different places. They asked: "If we add this new picture, does it catch a type of 'fish' (spatial concept) that our current net is missing?"

4. The Outcome: A Better Map

Using this AI-assisted strategy, they created a new set of 42 pictures (called the LCXRK set).

The Result: When they measured how much of the "ocean" these new pictures covered, they found that their new set covered the space much better than previous attempts.
The Language Test: They also used the AI to figure out which languages were missing from their study. The AI suggested that Portuguese and Romanian were very different from the languages they already had, so those should be the next ones to test with real humans.

5. Why This Matters

This paper isn't saying "AI will replace human scientists." Instead, it's saying "AI is the best tool to help us plan our experiments."

Before: Researchers had to guess which pictures to draw and which languages to study.
Now: They can use AI to simulate thousands of scenarios, find the gaps in their knowledge, and then use real humans to verify the most important ones.

In a nutshell: The authors used AI to build a better "menu" of spatial descriptions. They used the AI to taste-test thousands of combinations, figured out which dishes (scenes and languages) were missing from the menu, and created a new, more complete menu that covers the whole world of how we describe "where things are."

Here is a detailed technical summary of the paper "Quantifying and extending the coverage of spatial categorization data sets."

1. Problem Statement

Cross-linguistic variation in spatial categorization is a critical area of study in cognitive science and linguistics. However, unlike kinship or color systems, spatial relations lack a standardized representation of the "space" of possible relations. This makes it difficult to create comprehensive, comparable datasets across languages.

The most widely used stimulus set, the Topological Relations Picture Series (TRPS), consists of 71 images designed primarily to explore the boundaries of "in" and "on" relations. While previous extensions (e.g., Zhang, 2013; Landau et al., 2017) have added scenes, they often focus on specific subtypes of containment or support rather than maximizing the coverage of the entire universe of possible spatial scenes. The challenge is to scale up these datasets to include dozens of languages and hundreds of scenes without incurring the prohibitive costs of human data collection for every potential stimulus.

2. Methodology

The authors propose a framework that leverages Large Language Models (LLMs) to quantify coverage and guide the selection of new scenes and languages for human experimentation.

A. Coverage Formalization

The authors define coverage as the extent to which a subset of stimuli ( $S$ ) represents a larger universe of possibilities ( $U$ ).

Formula: $Coverage(S) = \frac{1}{|U|} \sum_{u \in U} \max_{s \in S} sim(s, u)$
Mechanism: For every scene in the universe $U$ , the algorithm finds the most similar scene in the subset $S$ . The coverage score is the average of these maximum similarities.
Similarity Metrics:
- Scene Similarity: Defined based on the agreement of LLM-generated labels across multiple languages. If two scenes receive the same label in a language, they are considered similar.
- Language Similarity: Defined using Variation of Information (VI), an information-theoretic metric quantifying the difference between the partitions of scenes induced by two different languages.

B. LLM Labeling Protocol

Model: Gemini 3 Flash was selected for its superior multilingual performance (top performer on MMMLU as of Jan 2026).
Task: The LLM was prompted to act as a native speaker of a target language and label 220 static images (comprising the original TRPS, Zhang set, LJSP set, and a new set) with a single spatial term.
Prompting Strategy: The model was provided with the image, annotations of focal/background objects, and a reference list of spatial terms in a reference language (English or Chinese).
Validation: The authors tested whether image input was necessary. They found that text-based descriptions of the scenes yielded nearly identical accuracy scores to image-based inputs, suggesting the LLM relies on semantic descriptions rather than visual analysis for these specific tasks.

C. Experimental Design

Validation: LLM labels for the original TRPS were compared against human data from Carstensen et al. (2019) and Xu and Kemp (2010).
Extension (LCXRK Set): The authors created a new set of 42 scenes (LCXRK) designed to fill gaps in the TRPS. These were generated by:
- Identifying spatial terms in English and Chinese not well-represented in the TRPS (e.g., "outside," "among," cardinal directions).
- Creating negations (e.g., "outside" instead of "inside") and reversals (swapping focal/background objects) of existing TRPS scenes.
Selection Strategy: The coverage metric was used to rank candidate scenes and languages to determine which additions would maximize the diversity of the dataset.

3. Key Contributions

LLM Validation for Spatial Semantics: Demonstrated that LLMs (specifically Gemini 3 Flash) achieve high alignment with human labels for spatial relations in high-resource languages, validating their use as a proxy for preliminary data collection.
Formal Definition of Coverage: Provided a mathematical framework to quantify how well a stimulus set represents the universe of possible spatial relations, moving beyond ad-hoc additions.
The LCXRK Dataset: Introduced a new set of 42 scenes specifically designed to maximize coverage of topological and frame-of-reference relations, filling gaps left by previous extensions.
Scalable Framework: Proposed a workflow where LLMs generate labels for thousands of potential scenes/languages, allowing researchers to prioritize a small subset for expensive human validation.

4. Results

LLM Accuracy:
- Binary Score: LLM labels matched at least one human label for >90% of scenes in most languages (e.g., English, Chinese, French).
- Graded Score: LLMs achieved scores within 0.15 of the maximum possible human agreement for major languages.
- Text vs. Image: Removing images from the prompt did not significantly degrade performance (Mean binary score: 0.91 image-based vs. 0.90 text-based), confirming the reliance on semantic descriptions.
Coverage Analysis:
- LCXRK Superiority: The new LCXRK set achieved a coverage score of 0.964 (95% CI [0.96, 0.995]), significantly outperforming the Zhang set (0.918) and LJSP set (0.918).
- MDS Visualization: Multidimensional scaling (MDS) of the 220 scenes showed that the LCXRK set fills distinct regions of the semantic space (e.g., "outside," cardinal directions) that were sparse in previous sets.
Language Selection:
- Using LLM-derived language distances, the authors identified Portuguese and Romanian as languages most distant from the existing Carstensen et al. (2019) dataset.
- This prediction was tentatively validated against human data from Xu and Kemp (2010), showing a Pearson correlation of 0.49, suggesting LLMs can effectively guide language selection.

5. Significance and Future Directions

Scaling Research: This approach provides a viable path to scaling spatial categorization studies from ~100 scenes to hundreds, and from a handful of languages to dozens (potentially covering the ~80 languages on platforms like Prolific).
Hybrid Methodology: The authors advocate for combining LLM-based selection with feature-based approaches (Carstensen et al., 2015). While feature-based methods can enumerate all logical configurations, LLMs can efficiently filter these down to the most informative subsets for human testing.
Limitations: The current approach is limited to high-resource languages supported by LLMs and Google Translate. Low-resource languages, which account for much of the world's linguistic diversity, remain underrepresented in training data.
Complementarity: The method complements corpus-based studies; LLMs can be used to extract spatial markers from multilingual corpora or to generate stimuli for experimental validation.

In conclusion, the paper establishes that LLMs are a reliable tool for quantifying and expanding the scope of cross-linguistic spatial data, offering a cost-effective strategy to build more comprehensive datasets that better capture the diversity of human spatial cognition.