DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting

This paper introduces DocSplit, the first comprehensive benchmark dataset and evaluation framework designed to assess and improve the ability of multimodal large language models to recognize and split complex, heterogeneous document packets into individual units.

Md Mofijul Islam, Md Sirajus Salekin, Nivedha Balakrishnan, Vincil C. Bishop, Niharika Jain, Spencer Romo, Bob Strahan, Boyi Xie, Diego A. Socolinsky

Published 2026-02-19
📖 5 min read🧠 Deep dive

Imagine you walk into a library and find a giant, chaotic pile of papers on a table. This isn't just one book; it's a "packet" containing a mortgage application, a medical report, a tax form, and a legal contract. But here's the catch: someone didn't just drop them in a stack. They took every single page from every document, threw them all into a blender, and dumped the result back onto the table.

Some pages are in the wrong order. Some pages from the medical report are sandwiched between pages of the tax form. Some pages are missing, and some are duplicated.

Your job? To look at this mess, figure out which pages belong together, sort them into their original documents, and put them in the right order.

This is exactly the problem DocSplit solves.

Here is a simple breakdown of the paper, using everyday analogies:

1. The Problem: The "Blender" Effect

In the real world (like at a bank, a hospital, or a law firm), people often send in "document packets." These are bundles of different papers mixed together. Sometimes, a human scanner accidentally shuffles them, or a machine jams and mixes up the order.

Current AI tools are great at reading one page at a time. They can tell you, "This is an invoice" or "This is a letter." But they are terrible at looking at a whole messy pile and saying, "Okay, pages 1, 4, and 9 belong to the first invoice, and pages 2, 5, and 10 belong to the second invoice."

The paper calls this Document Packet Splitting. It's like trying to un-mix a smoothie back into whole fruits.

2. The Solution: The "DocSplit" Benchmark

The authors (from Amazon Web Services) realized that to fix this, we need a better way to test AI. You can't just ask an AI to "fix this mess" and hope for the best. You need a standardized test.

They created DocSplit, which is like a "Driver's Ed" course for AI, but for sorting documents.

  • The Test Track: They built five different "courses" (datasets) ranging from easy to impossible.
    • Easy: A pile of just one type of document (e.g., 100 pages of invoices) that are just out of order.
    • Medium: A mix of different documents (invoices, letters, resumes) that are mostly in order but have a few swaps.
    • Hard: A total blender scenario where pages from 5 different documents are completely shuffled, interleaved, and mixed up.
  • The Goal: The AI has to do three things:
    1. Group: "These 5 pages go together."
    2. Classify: "This group is a Medical Report."
    3. Order: "Page 1 comes before Page 2."

3. The New Scorecard: How We Grade the AI

Before this paper, grading an AI was like a strict teacher who only gave you an "A" or an "F."

  • The Old Way: If the AI got the grouping right but the order wrong, it got an F. If it got the order right but the grouping wrong, it got an F. It was all or nothing.
  • The DocSplit Way: The authors created a new "Report Card" (metrics) that gives partial credit.
    • Imagine you are sorting a deck of cards. If you get all the suits separated (Clustering) but the cards within the suits are slightly mixed up (Ordering), the old system says "Fail." The new DocSplit system says, "Great job on the suits! You got a B+."
    • They use math (like a "Kendall's Tau" score) to measure how mixed up the order is, rather than just checking if it's perfect.

4. The Results: AI is Good at Sorting, Bad at Grouping

The authors tested several powerful AI models (like Claude, Qwen, and DeepSeek) on these new courses.

  • The Good News: The AIs are surprisingly good at figuring out the order of pages once they know which pages belong together. If you tell them "These 5 pages are a letter," they can usually put them in the right sequence.
  • The Bad News: They struggle with the grouping. When pages from different documents are mixed together, the AI often gets confused. It might think a page from a "Resume" belongs to the "Invoice" right next to it.
  • The Takeaway: Current AI is like a librarian who can read a book perfectly but gets confused when two different books are glued together. We need better "librarians" that can see the glue lines.

5. Why Does This Matter?

This isn't just an academic game. Think about:

  • Healthcare: A patient's file has their lab results, insurance forms, and doctor's notes all mixed up. If the AI can't sort them, the doctor might miss a critical diagnosis.
  • Banking: A loan application has 50 pages of documents from different sources. If the bank can't separate them, the loan gets rejected or delayed.
  • Law: A legal case involves thousands of pages of evidence. If the AI can't split the packets, lawyers waste weeks manually sorting paper.

Summary

DocSplit is the first major "stress test" for AI to see if it can untangle a messy pile of mixed-up documents. It provides the test questions, the grading rubric, and the results. The paper shows that while AI is getting smarter, it still has a long way to go before it can reliably act as a digital filing clerk for the real world.

The authors have released all their data and tools for free, inviting other researchers to help build the next generation of "document sorting" AI.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →