NERdME: a Named Entity Recognition Dataset for Indexing Research Artifacts in Code Repositories

Imagine the world of scientific research as a massive, bustling library. For years, librarians (and computer programs) have been very good at cataloging the books (the published scientific papers). They know how to find the title, the author, and the main topic of a book.

But there's a problem: the backpacks and toolkits that scientists use to actually do the research are often left in a messy pile in the corner. These toolkits are the code repositories (like GitHub) where the actual software, data, and instructions live.

The problem is that the instructions for these toolkits are written in README files. Think of a README file as a sticky note taped to a toolbox. It's written in free-form text (Markdown), full of jargon, links, and casual descriptions. It's messy, unstructured, and hard for a computer to read. If you ask a computer, "Where is the dataset for this project?" it might get confused because the answer is buried in a sentence like, "We used the 'COCO' dataset, which you can grab here."

Enter NERdME.

What is NERdME?

NERdME is like a super-smart training manual for computers, teaching them how to read those messy sticky notes (README files) and find the important ingredients inside.

The authors created a dataset of 200 of these README files and manually highlighted (annotated) over 10,000 specific pieces of information. They taught the computer to spot two very different types of "ingredients":

The "Paper" Ingredients: Things you'd find in a formal book, like the name of a Conference, a Publication, or a Workshop.
The "Code" Ingredients: Things you'd only find in the toolbox, like the Software used, the Programming Language (e.g., Python), the License (who owns the code), or the Dataset used.

Before this, computers were great at finding "Paper" ingredients but terrible at finding "Code" ingredients, and vice versa. NERdME is the first to teach them to find both at the same time.

How Did They Build It?

Imagine you are trying to teach a robot to identify fruits in a chaotic fruit basket.

The Collection: They went to a giant online marketplace (GitHub) and picked 200 baskets that were known to have good fruit (projects linked to real scientific papers).
The Annotation: They hired three human experts for every basket. These experts read the sticky notes and circled every mention of a "Dataset" or "Software."
The Consensus: If two out of three experts circled the same word, the robot learned that it was definitely a "Dataset." If they disagreed, the robot ignored it to avoid confusion. This ensured the training data was high-quality.

What Did They Discover?

They tested the robot using two methods:

The "Guessing" Robot (Zero-shot LLMs): A powerful AI that tries to guess the answers without any specific training on these files. It's like asking a smart person who has never seen a toolbox to guess what's inside just by looking at a photo. It was okay, but it missed a lot of details.
The "Trained" Robot (Fine-tuned Transformers): An AI that studied the 200 annotated files specifically. It learned the patterns.

The Results:

The Trained Robot was much better. It learned that "Python" usually means a programming language, while "COCO" usually means a dataset.
The "Long Tail" Problem: The robot was great at finding common things (like "Software" or "Datasets") because they appeared often. But it struggled with rare things (like "Workshops" or "Ontologies") because they appeared very few times. This is like a chef who is a master at making pizza but has never tried to make a soufflé because they've never seen the recipe before.
The Boundary Challenge: It was hard for the robot to know exactly where a name starts and ends. For example, is the name "TensorFlow" or just "Tensor"? The dataset showed that defining the exact edges of these names is tricky.

Why Does This Matter? (The Downstream Test)

To prove this wasn't just a game, they ran a treasure hunt.
They took the "Dataset" names the robot found in the READMEs and tried to match them to real records in a global database called Zenodo.

The Result: The robot was surprisingly good at this! It could take a messy mention like "the COCO dataset" and correctly link it to the official COCO record in the database.
The Analogy: It's like finding a handwritten note that says "The big red apple from the orchard down the road" and successfully finding the exact apple in a massive warehouse catalog.

The Big Picture

NERdME is a bridge. It connects the formal world of scientific papers with the messy, practical world of software code.

By teaching computers to understand the "sticky notes" on code repositories, we can:

Automatically find the tools scientists used.
Link papers to the actual data and code they rely on.
Make scientific research more reproducible (easier for others to repeat the experiment) because the "how-to" instructions are finally being understood by machines.

In short, NERdME is the dictionary that finally helps computers understand the messy, real-world instructions scientists leave behind, turning a pile of chaotic sticky notes into a well-organized library of research tools.

NERdME: a Named Entity Recognition Dataset for Indexing Research Artifacts in Code Repositories

What is NERdME?

How Did They Build It?

What Did They Discover?

Why Does This Matter? (The Downstream Test)

The Big Picture

1. Problem Statement

2. Methodology: Dataset Construction (NERdME)

3. Key Contributions

4. Experimental Results

A. Named Entity Recognition (NER) Performance

B. Downstream Task: Entity Linking (EL)

5. Significance and Future Directions

NERdME: a Named Entity Recognition Dataset for Indexing Research Artifacts in Code Repositories

What is NERdME?

How Did They Build It?

What Did They Discover?

Why Does This Matter? (The Downstream Test)

The Big Picture

1. Problem Statement

2. Methodology: Dataset Construction (NERdME)

3. Key Contributions

4. Experimental Results

A. Named Entity Recognition (NER) Performance

B. Downstream Task: Entity Linking (EL)

5. Significance and Future Directions

More like this

Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries

Markovian Generation Chains in Large Language Models