National and state-level datasets of United States forensic DNA databases 2001-2025

Imagine the United States has a massive, digital library. But instead of books, this library stores DNA profiles—genetic blueprints collected from crime scenes, convicted criminals, and, in many places, people who have been arrested but not yet convicted.

For the last 25 years, this library has been growing wildly. However, trying to study it has been like trying to read a book where the pages are scattered, some are written in different languages, and the librarian keeps changing the filing system every few years.

This paper is essentially a master key and a new map for that library. The researchers have spent years cleaning up the mess, organizing the scattered pages, and creating three new, easy-to-use datasets so anyone can finally understand how this system works, how big it is, and who is in it.

Here is a breakdown of what they did, using simple analogies:

1. The Problem: The "Shifting Sand" Library

Think of the FBI's national DNA database (called NDIS) as a giant scoreboard that updates every month.

The Issue: In the past, if you wanted to know how many profiles were in the database in 2010, you'd have to find a specific webpage from that time. But the FBI kept redesigning their website. Sometimes the data was on one page, sometimes split across fifty different pages. If you didn't save the screenshot exactly when you needed it, that data was gone forever.
The State Level: It was even worse at the state level. Some states published their numbers; others didn't. Some listed the types of people in the database (by race or gender); most didn't. It was like trying to count the apples in 50 different orchards, but only 10 orchards had signs, and the signs were written in different fonts.

2. The Solution: The "Time-Traveling Librarian"

The researchers acted like digital archaeologists. They didn't just look at the current website; they went back in time using the Internet Archive (Wayback Machine).

Dataset 1: The National Time Machine (NDIS)
They downloaded 11,359 snapshots of the FBI's website from July 2001 to August 2025. Imagine taking a photo of a scoreboard every single day for 24 years. They then wrote computer programs (parsers) to read these photos, even though the scoreboard looked different in 2005 than it did in 2020.
- The Result: A perfect, continuous timeline showing exactly how many profiles were added every month, broken down by state and type (criminal, arrestee, or crime scene).
Dataset 2: The State Policy Map (SDIS)
They went on a "treasure hunt" across all 50 states to find current rules.
- The Result: A cheat sheet that tells you: "Does this state collect DNA from people just arrested?" "Can they use DNA to find family members of suspects?" (Familial searching). It maps out the rules of the game for every single state.
Dataset 3: The Demographic Decoder (FOIA)
This was the hardest part. For years, no one knew the racial or gender makeup of the database because states rarely reported it.
- The Result: The researchers found old letters from 2018 where seven states finally answered a Freedom of Information Act (FOIA) request. They digitized these handwritten or scanned reports into a clean, computer-readable format. It's the first time this specific demographic data has been organized for easy study.

3. Cleaning the Data: The "Spot the Fake" Game

When you scrape data from the internet over 24 years, you get glitches.

The Glitch: Sometimes the FBI website would show "10,000" profiles, and the next month show "100,000" (a mistake), then go back to "10,000." Or, the website might get stuck and show the same number for three months in a row because it was "cached" (stuck in memory).
The Fix: The researchers built a "truth detector." They created a set of rules to flag these weird jumps.
- Analogy: Imagine a teacher grading a test. If a student's score jumps from 50 to 500 in one second, the teacher flags it as a "decimal error." If the score stays exactly the same for a year when it should be changing, the teacher flags it as "stale data." They didn't delete the weird numbers; they just put a "warning sticker" on them so researchers know to be careful.

4. Why Does This Matter?

Before this paper, studying the US DNA database was like trying to drive a car with a cracked windshield and a missing map. You could guess where you were going, but you couldn't be sure.

Now, researchers, journalists, and policymakers have:

A Clear History: They can see exactly how the database grew over time.
A Policy Guide: They can compare how different states treat arrestees or family searches.
Demographic Insight: They can start to analyze if the database disproportionately affects certain racial or gender groups (based on the limited data available).

In a nutshell: This paper is the "User Manual" for the US Forensic DNA Database. It takes a chaotic, 25-year history of scattered web pages and turns it into a clean, organized, and trustworthy toolkit for understanding one of the most powerful tools in modern criminal justice.

National and state-level datasets of United States forensic DNA databases 2001-2025

1. The Problem: The "Shifting Sand" Library

2. The Solution: The "Time-Traveling Librarian"

3. Cleaning the Data: The "Spot the Fake" Game

4. Why Does This Matter?

1. Problem Statement

2. Methodology

A. Federal Statistics Reconstruction (NDIS Time Series)

B. State-Level Policy and Counts (SDIS Cross-Section)

C. Demographic Data Standardization (FOIA)

D. Technical Validation and Anomaly Detection

3. Key Contributions

4. Results

5. Significance

National and state-level datasets of United States forensic DNA databases 2001-2025

1. The Problem: The "Shifting Sand" Library

2. The Solution: The "Time-Traveling Librarian"

3. Cleaning the Data: The "Spot the Fake" Game

4. Why Does This Matter?

1. Problem Statement

2. Methodology

A. Federal Statistics Reconstruction (NDIS Time Series)

B. State-Level Policy and Counts (SDIS Cross-Section)

C. Demographic Data Standardization (FOIA)

D. Technical Validation and Anomaly Detection

3. Key Contributions

4. Results

5. Significance

More like this

XR and Hybrid Data Visualization Spaces for Enhanced Data Analytics

Biometric-enabled Personalized Augmentative and Alternative Communications

The People's Gaze: Co-Designing and Refining Gaze Gestures with General Users and Gaze Interaction Experts

Enhancing Tool Calling in LLMs with the International Tool Calling Dataset

Human-Centered Ambient and Wearable Sensing for Automated Monitoring in Dementia Care: A Scoping Review