LAND: A Longitudinal Analysis of Neuromorphic Datasets

Imagine the field of Neuromorphic Engineering (building computer chips that think like human brains) as a massive, bustling kitchen where chefs are trying to invent new recipes for artificial intelligence.

For the last decade, this kitchen has been exploding with activity. Chefs are publishing thousands of new "recipes" (algorithms) and bringing in huge piles of ingredients (datasets) to test them. But, as this paper by Gregory Cohen and Alexandre Marcireau points out, the kitchen is in a bit of a mess.

Here is the story of the paper, broken down into simple concepts and analogies.

1. The "Data Hoarding" Problem

The Analogy: Imagine a library where people keep building new, tiny, private libraries instead of using the big public one.
The Reality: Even though there are now over 423 different datasets (collections of data) available, researchers keep complaining they need more data. Instead of using the existing 423, they often go out and collect brand new data for every single experiment.

The Issue: It's like 100 people trying to bake a cake, but instead of sharing the flour in the pantry, everyone buys their own bag of flour, bakes one cake, and throws the rest away. It's wasteful and slows down progress.
The "Popularity Contest": The paper found that most researchers only use a tiny handful of the famous, popular datasets (like the "bestsellers" in a bookstore). They ignore the hundreds of other useful datasets sitting on the shelves.

2. The "Link Rot" Disaster

The Analogy: Imagine a treasure map where the "X" marks the spot, but the path to the treasure is a link to a friend's personal Google Drive. If that friend moves houses, changes their email, or quits their job, the map leads to nowhere.
The Reality: A huge chunk of these datasets are hosted on personal links (like Google Drive, Dropbox, or a professor's university page).

The Problem: If the person who uploaded the data leaves the university or loses their account, the data vanishes forever. The paper found that nearly half of the datasets are stored this way, making them "unreliable tenants" that might move out at any time.
The Solution: We need "Sustainable Shares"—like a public library or a museum archive (e.g., Zenodo)—where data is stored forever, regardless of who created it.

3. The "Language Barrier"

The Analogy: Imagine trying to cook a recipe written in a language you don't speak, using ingredients measured in units you've never seen, inside a pot that requires a special key to open.
The Reality: Neuromorphic data comes in a chaotic mix of file formats. Some are in "binary" (computer code you can't read), some are in "CSV" (like Excel), and some are in "ROSbag" (a robot-specific format).

The Problem: There is no standard "English" for this data. One researcher might save a file where the time comes first, and another saves it where the location comes first. To use the data, you often need to download a massive, compressed file just to see what's inside, and then write special code just to open it.
The Solution: We need to agree on a few standard, easy-to-read formats (like Numpy or HDF5) so anyone can open the data without needing a PhD in file conversion.

4. The "Fake Ingredients" (Simulated Data)

The Analogy: Imagine a chef who has never tasted a real strawberry, so they make a "strawberry" out of red dye and sugar. It looks like a strawberry and tastes sweet, but it doesn't have the texture or the subtle flavor of the real fruit.
The Reality: Because collecting real data is hard and expensive, many researchers are using Simulated Data. They take regular video (like a YouTube clip) and use software to pretend it was recorded by a neuromorphic camera.

The Benefit: It's cheap and easy. You can simulate a car crash or a trip to the moon without ever leaving your desk.
The Danger: Simulated data is "too clean." Real neuromorphic sensors have noise, glitches, and weird behaviors. If you train your AI on "fake" data, it might work perfectly in the simulation but fail miserably when you put it in a real robot. The paper warns: Use simulation to test what you know, not to discover what you don't.

5. The "Blind Spot" (Lack of Context)

The Analogy: Imagine looking at a photo of a forest. You can see the trees, the sky, and the path. Now, imagine looking at a photo of the same forest, but it's been edited so that only the moving leaves are visible, and everything else is black. Without a caption, you have no idea if you are looking at a forest, a city, or a kitchen.
The Reality: Neuromorphic cameras only record changes (movement), not static images. If you look at the raw data, it often looks like random static or noise.

The Problem: Unlike a normal photo, you can't just "look" at neuromorphic data and understand what's happening. The paper argues that datasets are missing context. They don't explain where the camera was, what the lighting was, or why the camera was moving.
The Fix: Researchers must write detailed "storybooks" (metadata) for their data so others know what they are looking at.

The Big Takeaway: "Reduce, Reuse, Recycle"

The paper concludes with a plea to the community to change their habits:

Don't reinvent the wheel: If a dataset already exists, use it. Don't collect new data unless you absolutely have to.
Store it safely: Put your data in a public, permanent archive, not on your personal laptop.
Speak clearly: Use standard file formats and write clear instructions so anyone can use your data.
Be honest about fakes: If you use simulated data, admit it and explain its limits.

The "LAND" Tool:
Finally, the authors created a tool called LAND (List of Available Neuromorphic Datasets). Think of this as a Google Maps for data. Instead of wandering around lost in the woods, researchers can now use LAND to find exactly what they need, see if it's reliable, and download it without getting stuck in a dead-end alley.

In short: The field is growing fast, but it's messy. To build the future of brain-like computers, we need to stop hoarding ingredients, start sharing recipes, and make sure our "fake" strawberries don't trick us into thinking they are real.

1. Problem Statement

The neuromorphic engineering community faces a critical data problem. Despite a meteoric rise in the number of published neuromorphic datasets over the last decade, research continues to demand "more data" and larger datasets. This paradox is driven by:

Data-Driven Shift: Modern deep learning approaches require massive volumes of data, yet the community struggles to find, understand, and utilize existing datasets.
Accessibility Barriers: Significant practical difficulties exist in downloading, accessing, and parsing available data due to non-standardized formats, broken links, and restrictive licensing.
Lack of Context: Neuromorphic event-based data lacks the inherent visual context found in frame-based images, making it difficult to determine the nature of a task or scene without extensive metadata.
Reusability Issues: There is a concerning trend where researchers create new datasets rather than reusing existing ones, and citation patterns show a high inequality where only a few datasets are widely used while the majority are ignored.
Simulation Pitfalls: The rise of synthetic data (simulated or video-to-event) introduces potential biases and inaccuracies when exploring novel applications, as simulators often fail to capture the unique physical characteristics of real neuromorphic sensors.

2. Methodology

The authors conducted a longitudinal analysis of the neuromorphic landscape, examining 423 datasets derived from 386 academic publications, totaling over 41 TB of data.

Data Collection: The authors compiled a comprehensive catalogue (the "LAND" tool) covering datasets from 2015 to 2025.
Reusability Analysis: They used academic citations as a proxy for dataset usage. They analyzed the ratio of new datasets to citing papers and examined the distribution of citations using the Gini coefficient (an economic metric for inequality) to measure how evenly citations are distributed across the dataset population.
Distribution & Accessibility Audit: The study categorized distribution methods (e.g., sustainable repositories vs. personal links) and analyzed file formats (e.g., HDF5, ROSBag, Numpy, binary) to assess technical accessibility.
Content Analysis: The authors evaluated the nature of the data, distinguishing between Real Data (captured by physical sensors), Quasi-Real (monitor conversions), and Simulated Data (video-to-event or ray-tracing). They also investigated metadata quality, specifically the lack of implied context in event streams.

3. Key Contributions

The LAND Tool: The primary contribution is the List of Available Neuromorphic Datasets (LAND), an interactive, living catalogue that allows researchers to locate, filter, and understand existing datasets before collecting new ones.
Comprehensive Taxonomy: The paper establishes a rigorous definition of what constitutes a "neuromorphic dataset" and categorizes them by source (real, simulated, monitor conversion), distribution method, and file format.
Critical Metrics: Introduction of the Gini coefficient to neuromorphic research to quantify the inequality in dataset citation, revealing that citation distribution is highly skewed.
Standardization Guidelines: A set of actionable recommendations for the community regarding dataset creation, distribution, and documentation to improve reusability and reproducibility.

4. Key Results & Findings

A. Reusability and Citation Inequality

Growth vs. Usage: While the number of new datasets has grown exponentially (especially post-2021), the median number of datasets cited per paper is approximately 1. Most papers cite only a single dataset.
The "Review" Effect: The high average citation count (approx. 14 citations/dataset) is driven almost entirely by a small number of survey and review papers.
Inequality: The Gini coefficient for dataset citations has risen to 0.65 (in 2025), indicating extreme inequality. A small subset of "famous" datasets (e.g., DVS-Gesture, MVSEC) receives the vast majority of citations, while the majority of datasets are effectively ignored.

B. Data Availability and Distribution

Fragile Hosting: There is a worrying trend toward Personal Shares (Google Drive, OneDrive), which account for 42% of distribution methods. These links are tied to individuals and often break when researchers leave institutions, threatening long-term reproducibility.
Sustainable Shift: There is a positive rise in Sustainable Shares (Zenodo, HuggingFace, IEEE DataPort), which offer versioning and DOIs, but they are not yet the dominant method.
Access Barriers: Many datasets require complex registration, phone verification, or manual forms, preventing automated access and hindering large-scale benchmarking.

C. Data Accessibility and Formats

Format Fragmentation: There is no single standard. Formats include ROSBag, HDF5, Numpy, aedat, and various binary formats.
Trend: The community is shifting from proprietary binary formats toward Numpy and ROSBag due to Python integration and multi-modal support. However, this creates issues with data type consistency (e.g., integer vs. float coordinates) and lack of self-describing metadata.
Time and Space Ambiguity: Timestamps often lack a standard reference (relative vs. absolute $t=0$ ), and spatial coordinates are not always normalized, leading to errors when combining datasets.

D. The Rise of Simulated Data

Volume: Simulated data (video-to-event and ray-tracing) has grown significantly, peaking around 2023-2024, driven by the need for large training sets for deep learning.
Pitfalls: Simulated data often fails to replicate the specific noise profiles, dynamic range, and asynchronous nature of real sensors. Using simulated data to explore novel applications is risky, as the simulation is limited by current understanding of the problem.

E. Lack of Context

The "Black Box" Problem: Unlike frame-based images, event streams do not visually convey the scene context. A static scene generates almost no events (only noise). Without detailed metadata describing the scene geometry, lighting, and camera motion, the data is often unusable for new tasks.

5. Significance and Recommendations

The paper argues that the field must shift from data creation to data curation and reuse. The authors propose the "Reduce, Re-use, Re-process" framework:

Reduce: Prioritize using existing datasets. Only create new data if existing ones are fundamentally unsuitable, and document why.
Reuse: Extend or annotate existing datasets (Meta-datasets) rather than building new ones from scratch.
Distribute Sustainably: Use persistent repositories (Zenodo, etc.) with DOIs. Avoid personal links.
Prioritize Accessibility: Use open, standard formats (Numpy, HDF5) and provide raw data alongside processed data. Avoid complex licensing barriers.
Simulate Responsibly: Use simulation for known tasks but validate against real-world data. Be transparent about the limitations of simulators.
Describe Context: Provide exhaustive metadata regarding the environment, camera motion, and task definition to compensate for the lack of inherent context in event data.

Conclusion:
The LAND paper serves as a critical wake-up call for the neuromorphic community. It highlights that the bottleneck is no longer the generation of data, but the accessibility, standardization, and reusability of existing data. By adopting the proposed best practices and utilizing the LAND tool, the community can move toward a more robust, reproducible, and data-driven research ecosystem.