DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting

Imagine you walk into a library and find a giant, chaotic pile of papers on a table. This isn't just one book; it's a "packet" containing a mortgage application, a medical report, a tax form, and a legal contract. But here's the catch: someone didn't just drop them in a stack. They took every single page from every document, threw them all into a blender, and dumped the result back onto the table.

Some pages are in the wrong order. Some pages from the medical report are sandwiched between pages of the tax form. Some pages are missing, and some are duplicated.

Your job? To look at this mess, figure out which pages belong together, sort them into their original documents, and put them in the right order.

This is exactly the problem DocSplit solves.

Here is a simple breakdown of the paper, using everyday analogies:

1. The Problem: The "Blender" Effect

In the real world (like at a bank, a hospital, or a law firm), people often send in "document packets." These are bundles of different papers mixed together. Sometimes, a human scanner accidentally shuffles them, or a machine jams and mixes up the order.

Current AI tools are great at reading one page at a time. They can tell you, "This is an invoice" or "This is a letter." But they are terrible at looking at a whole messy pile and saying, "Okay, pages 1, 4, and 9 belong to the first invoice, and pages 2, 5, and 10 belong to the second invoice."

The paper calls this Document Packet Splitting. It's like trying to un-mix a smoothie back into whole fruits.

2. The Solution: The "DocSplit" Benchmark

The authors (from Amazon Web Services) realized that to fix this, we need a better way to test AI. You can't just ask an AI to "fix this mess" and hope for the best. You need a standardized test.

They created DocSplit, which is like a "Driver's Ed" course for AI, but for sorting documents.

The Test Track: They built five different "courses" (datasets) ranging from easy to impossible.
- Easy: A pile of just one type of document (e.g., 100 pages of invoices) that are just out of order.
- Medium: A mix of different documents (invoices, letters, resumes) that are mostly in order but have a few swaps.
- Hard: A total blender scenario where pages from 5 different documents are completely shuffled, interleaved, and mixed up.
The Goal: The AI has to do three things:
1. Group: "These 5 pages go together."
2. Classify: "This group is a Medical Report."
3. Order: "Page 1 comes before Page 2."

3. The New Scorecard: How We Grade the AI

Before this paper, grading an AI was like a strict teacher who only gave you an "A" or an "F."

The Old Way: If the AI got the grouping right but the order wrong, it got an F. If it got the order right but the grouping wrong, it got an F. It was all or nothing.
The DocSplit Way: The authors created a new "Report Card" (metrics) that gives partial credit.
- Imagine you are sorting a deck of cards. If you get all the suits separated (Clustering) but the cards within the suits are slightly mixed up (Ordering), the old system says "Fail." The new DocSplit system says, "Great job on the suits! You got a B+."
- They use math (like a "Kendall's Tau" score) to measure how mixed up the order is, rather than just checking if it's perfect.

4. The Results: AI is Good at Sorting, Bad at Grouping

The authors tested several powerful AI models (like Claude, Qwen, and DeepSeek) on these new courses.

The Good News: The AIs are surprisingly good at figuring out the order of pages once they know which pages belong together. If you tell them "These 5 pages are a letter," they can usually put them in the right sequence.
The Bad News: They struggle with the grouping. When pages from different documents are mixed together, the AI often gets confused. It might think a page from a "Resume" belongs to the "Invoice" right next to it.
The Takeaway: Current AI is like a librarian who can read a book perfectly but gets confused when two different books are glued together. We need better "librarians" that can see the glue lines.

5. Why Does This Matter?

This isn't just an academic game. Think about:

Healthcare: A patient's file has their lab results, insurance forms, and doctor's notes all mixed up. If the AI can't sort them, the doctor might miss a critical diagnosis.
Banking: A loan application has 50 pages of documents from different sources. If the bank can't separate them, the loan gets rejected or delayed.
Law: A legal case involves thousands of pages of evidence. If the AI can't split the packets, lawyers waste weeks manually sorting paper.

Summary

DocSplit is the first major "stress test" for AI to see if it can untangle a messy pile of mixed-up documents. It provides the test questions, the grading rubric, and the results. The paper shows that while AI is getting smarter, it still has a long way to go before it can reliably act as a digital filing clerk for the real world.

The authors have released all their data and tools for free, inviting other researchers to help build the next generation of "document sorting" AI.

1. Problem Definition

The paper addresses a critical, yet underexplored, challenge in real-world document processing: Document Packet Splitting.

The Scenario: In industries like healthcare, finance, and law, documents are often received as "packets"—loosely ordered, multi-page sequences containing multiple distinct documents stitched together.
The Challenge: These packets often suffer from:
- Interleaving: Pages from different documents are mixed (e.g., page 1 of Doc A, page 1 of Doc B, page 2 of Doc A).
- Shuffling: Pages within a single document are out of order.
- Lack of Separators: No clear visual or structural boundaries between distinct documents.
- Homogeneity: Multiple documents of the same type (e.g., two contracts) may appear consecutively, making boundary detection difficult without semantic reasoning.
The Gap: Existing benchmarks (e.g., RVL-CDIP) focus on single-page classification. Current Multimodal Large Language Models (MLLMs) lack systematic evaluation for decomposing these complex, heterogeneous packets into their constituent logical units while maintaining correct page ordering.

2. Methodology

A. The DocSplit Task Formalization

The authors formalize the task as transforming an input sequence of $N$ pages ( $P$ ) into a structured representation that captures:

Document Boundaries ( $B$ ): Identifying start ( $s_i$ ) and end ( $e_i$ ) page indices for each document.
Classification ( $t_i$ ): Assigning a document type to each identified segment.
Ordering ( $O$ ): Reconstructing the correct sequential order of pages within each identified document.

B. The DocSplit Benchmark Dataset

The authors constructed the first comprehensive benchmark derived from the RVL-CDIP-MP dataset, comprising 5 distinct variations to test different levels of complexity:

DocSplit-Mono-Seq: Single category, sequential order. (Tests boundary detection without type transitions).
DocSplit-Mono-Rand: Single category, pages shuffled. (Tests boundary detection + sequence reconstruction).
DocSplit-Poly-Seq: Multi-category, sequential order. (Tests inter-document boundary detection based on content shifts).
DocSplit-Poly-Int: Multi-category, pages interleaved (round-robin). (Tests identifying non-contiguous pages belonging to the same document).
DocSplit-Poly-Rand: Multi-category, fully randomized. (Maximum entropy; tests robustness under worst-case conditions).

Scale: 52.6K documents, 1.55M pages, 13 document categories (e.g., Invoice, Form, Letter, Resume).
Data: Includes both visual (images) and textual (OCR) representations.

C. Novel Evaluation Framework

The paper proposes a new evaluation metric suite that moves beyond binary "exact match" accuracy:

Clustering Metrics:
- Rand Index (RI): Measures pairwise agreement in grouping.
- V-measure: Harmonic mean of Homogeneity (clusters contain only one class) and Completeness (all members of a class are in one cluster).
- Combined Clustering Score ( $S_{clustering}$ ): Weighted combination of V-measure and RI.
Ordering Metrics:
- Kendall's Tau ( $\tau$ ): Measures the correlation between predicted and ground-truth page sequences.
Unified Packet Score ( $S_{packet}$ ):
- Formula: $S_{packet} = \alpha \cdot S_{clustering} + \beta \cdot S_{ordering}$ .
- Advantage: Unlike classical metrics (Page+Split+Order) which fail completely if a single boundary is wrong, the proposed metrics provide partial credit, allowing for granular analysis of model performance.

3. Key Contributions

First Comprehensive Benchmark: Introduction of DocSplit, the first dataset specifically designed for document packet splitting with five complexity levels.
Task Formalization: A rigorous mathematical definition of the splitting task involving simultaneous boundary detection, classification, and sequence reconstruction.
Novel Evaluation Metrics: A framework that decouples clustering and ordering performance, offering continuous scoring rather than binary pass/fail, which better reflects real-world utility.
Extensive Baseline Evaluation: Evaluation of state-of-the-art MLLMs (Claude Sonnet/Haiku 4.5, DeepSeek, Gemma 3, Qwen 3) on these tasks.
Open Release: Public release of datasets, evaluation code, and a web-based "Test Studio" for the research community.

4. Experimental Results & Discussion

The authors evaluated five models across the five benchmark variations. Key findings include:

Performance Gaps: There is a significant performance gap between models. Qwen 3 VL achieved the highest scores (Packet Score ~0.92–0.95), while Gemma 3 struggled significantly with clustering (as low as 0.56).
Clustering vs. Ordering:
- Ordering is generally robust; models that successfully cluster pages can often reconstruct the correct order (Kendall's Tau > 0.97).
- Clustering (Boundary Detection) is the primary bottleneck. Performance varies widely (0.56 to 0.90), indicating that models struggle to identify where one document ends and another begins, especially in shuffled or interleaved scenarios.
Impact of Complexity:
- Mono-Seq is the easiest (Packet > 0.93).
- Poly-Rand (fully randomized multi-category) is the hardest, causing classical metrics to degrade by 20–30% for weaker models.
Metric Superiority: The proposed metrics revealed that models often achieve "near-correct" results that classical metrics penalize heavily. For example, a model might correctly group pages but swap labels; classical metrics score this as 0% failure, while the proposed framework assigns partial credit, offering better diagnostic insights.

5. Significance and Future Directions

Real-World Impact: This work directly addresses bottlenecks in high-stakes domains (legal discovery, insurance claims, mortgage processing) where manual packet splitting is costly and error-prone.
Research Direction: The results highlight that cross-page consistency modeling and hierarchical document representations are the next critical areas for research, as current models rely too heavily on local page features rather than global document semantics.
Limitations & Future Work:
- Current benchmarks focus on moderate packet sizes (5–20 pages); future work needs to address enterprise-scale packets (100+ pages).
- Experiments were primarily text-based; future baselines should leverage full multimodal (visual + text) inputs to utilize layout cues.
- The framework is model-agnostic, encouraging the development of specialized models for document packet splitting.

In conclusion, DocSplit provides the necessary infrastructure (dataset + metrics) to advance the field of document understanding from simple single-page classification to complex, multi-document packet processing, revealing that while MLLMs are powerful, they still require significant improvement in structural reasoning and boundary detection.