Learning Page Order in Shuffled WOO Releases

Imagine you have a massive, chaotic pile of paper documents. Some are emails, some are legal contracts, some are spreadsheets, and some are scanned notes. A government agency (in the Netherlands) has dumped all these pages into a single PDF file, but they've been completely shuffled. The pages are mixed up like a deck of cards that's been thrown in the air.

Your job? Put the deck back in order.

This is exactly what the researchers in this paper tried to do. They took 5,461 of these "shuffled decks" (called WOO documents) and asked: Can a computer figure out the correct chronological order of the pages just by reading the text on them?

Here is the breakdown of their journey, explained with some everyday analogies.

1. The Problem: A Messy Puzzle with No Picture

Usually, when you try to order pages, you look for clues: "Page 1 says 'Dear Sir,' and Page 2 says 'Sincerely.'" But these government documents are a weird mix.

The Analogy: Imagine a puzzle where the pieces aren't from a picture of a cat, but a random mix of a map, a grocery list, a love letter, and a tax form. The end of one page might be a legal signature, and the very next page (in the real world) might be a completely unrelated email about lunch.
The Challenge: Because the content jumps around so wildly, there are no obvious "clues" to tell the computer which page comes next. It's like trying to solve a puzzle where the pieces don't actually fit together visually.

2. The Contenders: Five Different Strategies

The researchers tested five different "AI detectives" to see who could solve the puzzle best.

The "Guess and Go" (Heuristics): These are simple rules. "Pick a page, then find the page that looks most similar to it."
- Result: Terrible. Since the pages are so different, looking for "similar" pages is like trying to find your next step in a maze by looking for a step that looks like the last one. It doesn't work.
The "List Maker" (BiLSTM): This model looks at all the pages at once and gives each one a score, like a teacher grading a test. "This page feels like it belongs in spot #3."
- Result: Decent for short documents, but it gets confused as the pile gets bigger.
The "Line-Up" (Pointer Networks): This model acts like a game show host. It picks one page, puts it in the line, then picks the next one from the remaining pile, and so on. It builds the order step-by-step.
- Result: Good, but it starts to stumble when the line gets too long.
The "Translator" (Seq2Seq Transformers): This is a very powerful, modern AI model. It tries to look at the whole shuffled pile and "translate" it into an ordered list, one page at a time.
- Result: The Big Failure. It worked amazingly well for short documents (2–5 pages), but when the pile got big (20+ pages), it completely crashed. It went from being a genius to being worse than random guessing.
The "Matchmaker" (Pairwise Ranking): Instead of trying to build the whole line at once, this model asks a simple question for every possible pair of pages: "Does Page A come before Page B?" It does this for every single pair, then adds up the votes to build the final order.
- Result: The Winner. It was the most consistent and accurate, especially for longer documents.

3. The Big Surprises

Surprise #1: The "Translator" Crashed on Long Documents

The "Translator" model (Seq2Seq) was great for short stories but failed miserably on long novels.

The Analogy: Imagine a student who memorizes the answers to a 5-question quiz perfectly. But when you give them a 25-question quiz, they panic and start guessing randomly.
Why? The researchers found that the model was relying too much on "position tags" (like labels saying "I am Page 1," "I am Page 2"). When the document got longer than what it saw in training, it got lost. Even when they tried to fix the labels, the model still failed, suggesting the whole "step-by-step" approach just doesn't work for these messy, long documents.

Surprise #2: "Baby Steps" Didn't Help (Curriculum Learning)

In education, we often teach kids simple things first (addition) before hard things (calculus). This is called "Curriculum Learning." The researchers tried teaching the AI to order short documents first, then gradually moving to longer ones.

The Result: It actually made things worse (39% worse on long documents).
The Analogy: It's like teaching a driver to park in an empty parking lot, then expecting them to drive a Formula 1 race car immediately after. The skills are different!
Why? The AI learned that short documents need to look at nearby pages to find order. But long documents need to look at the whole picture to find order. By forcing the AI to learn the "short way" first, it got stuck in a bad habit and couldn't switch to the "long way" later.

4. The Winning Strategy: Specialized Teams

The best solution wasn't one giant AI trying to do everything. Instead, they built five specialized teams.

One team only handled 2–5 page documents.
Another team only handled 21–25 page documents.
The Analogy: Instead of hiring one general contractor to fix a leaky faucet and build a skyscraper, you hire a plumber for the faucet and a structural engineer for the skyscraper.
The Result: This approach worked incredibly well. For documents up to 15 pages, they got the order right almost 95% of the time. Even for the massive 25-page documents, they improved the accuracy significantly compared to the "one-size-fits-all" model.

The Bottom Line

This paper teaches us that one size does not fit all in AI.

Messy data is hard: When documents are a random mix of emails and spreadsheets, simple clues don't work.
Step-by-step fails for long tasks: Trying to build a long list one item at a time (like the "Translator" model) causes the AI to lose its way.
Specialization wins: Breaking the problem down and giving the AI specific tools for specific lengths of documents is the key to success.

The researchers have made their code and data public, so anyone can try to solve this "shuffled government document" puzzle themselves!

Here is a detailed technical summary of the paper "Learning Page Order in Shuffled WOO Releases" by Efe Kahraman and Giulio Tosato.

1. Problem Definition

The paper addresses the challenge of recovering the chronological page order of shuffled documents within the Dutch Freedom of Information (WOO) releases.

Dataset: 5,461 heterogeneous PDF documents (2–25 pages) compiled from emails, legal texts, spreadsheets, and scanned materials.
Core Challenge: Unlike standard narrative ordering (e.g., sentences in a story), WOO documents are "collages" where adjacent pages often lack semantic continuity. Metadata (timestamps, page numbers) is frequently missing or unreliable.
Task: Given a set of shuffled page embeddings, predict the original order. The search space is factorial ( $N!$ ), making exact permutation prediction computationally intractable for long documents.
Evaluation Metric: Kendall's $\tau$ (rank correlation), which measures the agreement between predicted and ground-truth relative ordering.

2. Methodology

The authors evaluated five distinct methodological families across 11 model configurations:

A. Baseline Heuristics

Random: Random permutation.
Greedy Nearest Neighbor (NN): Starts at a random page and appends the most similar unvisited page in embedding space.
TSP Nearest Neighbor: Treats ordering as a Traveling Salesman Problem using NN heuristics.
Finding: These failed ( $\tau < 0.17$ ), proving that semantic similarity in embedding space does not correlate with page adjacency in these heterogeneous documents.

B. Neural Position Prediction

BiLSTM Position Classifier: A bidirectional LSTM processes all embeddings simultaneously to predict a position score for each page independently.
Finding: Performed moderately well on short documents but degraded significantly on longer ones.

C. Autoregressive Models (Pointer Networks)

Pointer MLP & Pointer LSTM: Models that select one page at a time from remaining candidates.
- Pointer MLP: Uses feedforward layers; decisions based only on the most recent selection.
- Pointer LSTM: Uses an encoder-decoder architecture with hidden state accumulation to remember the full history of selections.
Finding: Showed "graceful degradation" as document length increased, outperforming heuristics but lagging behind pairwise methods.

D. Sequence-to-Sequence (Seq2Seq) Transformers

Architecture: Encoder processes all pages; decoder generates the order one position at a time.
Variants:
1. Learned Positional Encodings: Randomly initialized vectors learned during training.
2. Sinusoidal Positional Encodings: Fixed mathematical patterns.
3. No Positional Encodings: Relies solely on content.
Finding: Catastrophic failure on long documents. Performance dropped from $\tau=0.918$ (2–5 pages) to $\tau=0.014$ (21–25 pages) for the learned variant. Ablation studies suggested positional encodings were a factor, but the failure persisted across all variants, indicating deeper architectural issues.

E. Pairwise Ranking Transformers

Approach: Instead of generating a sequence, the model predicts pairwise "comes-before" relations ( $s_{ij}$ ) for all pairs, then aggregates scores to sort pages.
Variants:
1. Universal Model: Trained on all lengths with uniform weighting.
2. Specialized Models (Direct Training): Five separate models, each optimized for a specific length range (e.g., 2–5, 6–10, ..., 21–25). They use 5x loss weighting on their target range while still seeing all data.
3. Specialized Models (Curriculum Learning): Models trained progressively from short to long documents.

3. Key Results

Model Type	2–5 Pages ( $\tau$ )	11–15 Pages ( $\tau$ )	21–25 Pages ( $\tau$ )
Random	0.007	-0.001	0.001
Greedy NN	0.168	0.062	0.033
Pointer LSTM	0.889	0.572	0.362
Seq2Seq (Learned)	0.918	0.343	0.014
Universal Pairwise	0.922	0.509	0.175
Specialized Pairwise (Direct)	0.953	0.722	0.380
Specialized Pairwise (Curriculum)	0.915	0.662	0.233

Best Performance: The Specialized Pairwise Ranking Transformer (Direct Training) achieved the highest scores, reaching $\tau=0.953$ for short documents and $\tau=0.722$ for 11–15 page documents.
Scaling Behavior:
- Seq2Seq Transformers failed to generalize to long sequences, collapsing to near-random performance.
- Curriculum Learning underperformed direct training by 39% on long documents.
- Specialization yielded a massive gain of +0.21 $\tau$ on the longest documents compared to the universal model.

4. Key Contributions & Insights

A. The Failure of Seq2Seq on Long Documents

The paper identifies a severe generalization gap in autoregressive transformers for this task. While they excel on short sequences, they fail on long ones.

Cause: A combination of positional encoding limitations (inability to extrapolate to unseen positions) and architectural depth.
Evidence: Even with sinusoidal encodings (which theoretically allow extrapolation), performance remained near zero ( $\tau=0.061$ ), suggesting the issue is not just positional encoding but the fundamental difficulty of autoregressive generation for long permutations.

B. The "Curriculum Learning" Paradox

Contrary to standard intuition, curriculum learning (training on short docs first) hurt performance on long documents.

Reason: Short and long documents require fundamentally different ordering strategies.
- Short Docs: Rely on local attention (77.9% of attention within $\pm2$ positions).
- Long Docs: Require global attention (average distance 7.59 positions).
Conclusion: Forcing a model to learn local strategies first prevents it from acquiring the global reasoning skills necessary for long documents.

C. The Power of Model Specialization

Training separate models for specific document length ranges significantly outperformed a single "universal" model.

Mechanism: By applying 5x loss weighting to the target range, specialized models learned distinct representational strategies for different document complexities without overfitting, as they still saw data from other lengths.

5. Significance and Future Work

Practical Impact: This work provides a robust solution for reordering heterogeneous government documents, a critical step in making FOI releases searchable and usable.
Theoretical Insight: It challenges the efficacy of curriculum learning for tasks where "simple" and "complex" instances require incompatible cognitive strategies (local vs. global). It also highlights the fragility of autoregressive transformers in permutation tasks involving long sequences.
Limitations: The current approach relies solely on text embeddings, ignoring visual cues (charts, tables) which may contain ordering signals.
Future Directions: Incorporating multimodal embeddings, automatic segmentation of multi-page logical units, and exploring architectures with better length extrapolation (e.g., ALiBi, RoPE).

Code and Data Availability: The authors have made the code and dataset publicly available on GitHub and HuggingFace, respectively.