Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM

This paper introduces Dripper, a lightweight framework that reformulates web main content extraction as a constrained sequence labeling task using Small Language Models, achieving superior efficiency and accuracy compared to both traditional heuristics and massive generative LLMs while enabling the creation of high-quality training corpora.

Mengjie Liu, Jiahui Peng, Wenchang Ning, Pei Chu, Jiantao Qiu, Ren Ma, He Zhu, Rui Min, Lindong Lu, Linfeng Hou, Kaiwen Liu, Yuan Qu, Zhenxiang Li, Chao Xu, Zhongying Tu, Wentao Zhang, Conghui He

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine the internet is a massive, chaotic library. Every day, billions of new books (web pages) are added. But here's the problem: most of these "books" are wrapped in layers of sticky tape, plastic wrapping, and cardboard boxes (ads, pop-ups, navigation menus, and code) that make it impossible to read the actual story inside.

To train smart AI, we need to read the stories, not the packaging. But reading billions of pages one by one is slow, expensive, and often leads to mistakes.

Enter Dripper. Think of Dripper as a super-efficient, high-speed librarian robot that knows exactly how to unwrap these packages without tearing a single page of the story.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Heavy Box" Dilemma

Traditional tools try to read the whole box (the raw HTML code) to find the story. This is like trying to find a specific sentence in a book by reading the entire cardboard box it came in first. It's slow, and the AI gets confused by all the extra stuff.

On the other hand, the "super-smart" AI models (like the giants in the industry) are great at understanding context, but they are like heavyweight champions. They are so expensive to run and so slow that you can't use them to process the entire internet. Plus, they sometimes "hallucinate"—they might make up a story that wasn't in the book at all.

2. The Solution: The "Dual-Branch" Strategy

Dripper solves this with a clever two-step trick, like a sneaky spy and a faithful scribe working together.

  • Step A: The Spy (Simplified HTML)
    First, Dripper takes the messy web page and strips it down to its bare bones. It removes the ads, the fancy fonts, and the hidden code, leaving only the skeleton of the page. It's like taking a complex machine and removing all the paint and bolts to just look at the gears. This "skeleton" is tiny and easy to read.

    • The Result: A small, lightweight AI (called Dripper-0.6B) looks at this skeleton and quickly decides: "Is this part of the story? Yes or No?" It doesn't rewrite the story; it just puts a sticky note on every block saying "KEEP" or "TRASH."
  • Step B: The Scribe (Mapping HTML)
    While the Spy is making those quick decisions, the Scribe is holding the original, pristine version of the page. Once the Spy says, "Keep block #2," the Scribe grabs the exact original text from block #2.

    • The Result: Because the Scribe never rewrote anything, the final story is perfect. No typos, no missing words, and no made-up facts.

3. Why It's a Game Changer

  • Speed: Because the "Spy" only looks at a tiny skeleton, it can process 3 pages every second on a single computer chip. That's like reading a whole library in the time it takes to brew a cup of coffee.
  • Accuracy: By using a "Yes/No" labeling system instead of asking the AI to "write the story," Dripper stops the AI from making things up (hallucinations). It's like asking a guard to point at the right door rather than asking the guard to describe the room behind it.
  • The "Magic" Benchmark: The team built a new test called WebMainBench (like a final exam for web scrapers). Dripper scored higher than the old, slow tools and came very close to the massive, expensive "super-AI" models, but at a fraction of the cost.

4. The Big Picture: Why Does This Matter?

Think of AI training data as the food for a giant robot brain. If the food is full of plastic wrappers (ads and junk code), the robot gets sick and learns bad habits. If the food is clean and nutritious (pure text), the robot gets smarter.

The authors proved that by using Dripper to clean the data, the resulting AI models actually learned better. They became smarter at reasoning and answering questions than models trained on data cleaned by older, slower methods.

Summary Analogy

Imagine you are trying to copy a recipe from a magazine, but the magazine is full of ads, celebrity gossip, and colorful pictures.

  • Old Tools: Try to copy the whole page, getting confused by the ads.
  • Big AI: Reads the whole page perfectly but takes 10 minutes to copy one recipe and costs $100.
  • Dripper: Quickly scans the page, puts a green dot on the recipe and a red dot on the ads (taking 1 second), and then you just copy the parts with the green dots. It's fast, cheap, and the recipe comes out perfect.

Dripper is the tool that makes the internet readable for the next generation of AI, ensuring they learn from the best parts of human knowledge without the noise.