Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

Imagine you are a financial analyst trying to find a specific number in a massive, 200-page PDF report about a company's earnings. The problem? That PDF is designed for humans to read, not for computers to understand. It's like a beautifully decorated book where the text, tables, and charts are all mixed together without clear labels. If you ask a computer to "find the profit margin," it might get confused, looking at the wrong page or mixing up numbers from different tables.

This paper is essentially a scientific taste test to figure out the best way to teach computers how to read these messy financial PDFs and answer questions accurately.

Here is the breakdown using simple analogies:

1. The Problem: The "Messy Library"

Think of a financial PDF as a library where all the books have been shredded and thrown into a giant pile.

The Goal: You want to find a specific sentence (e.g., "What was the revenue in 2023?").
The Challenge: Before a computer can find that sentence, it has to:
1. Reassemble the pages (Parsing): Figure out which text belongs to a table and which belongs to a paragraph.
2. Cut the pages into bite-sized pieces (Chunking): Computers can't read a whole 200-page book at once; they have a "short attention span" (limited memory). So, you have to cut the text into small chunks.
3. Find the right piece (Retrieval): When you ask a question, the system needs to grab the right chunk immediately.

2. The Experiment: The "Cooking Competition"

The researchers set up a massive experiment to see which "chefs" (tools) and "cutting styles" (strategies) work best. They tested this on two types of "recipes":

FinanceBench: Questions based on text (like reading a story).
TableQuest (New!): Questions based on tables (like reading a spreadsheet). This is a big deal because most previous tests ignored tables, which is where the real money data usually hides.

They tested:

6 Different "Scissors" (PDF Parsers): Some are basic scissors (just cut text), while others are high-tech laser cutters that can see tables and images.
6 Different "Cutting Styles" (Chunking): Some cut every 50 words, some cut at every sentence, and some use AI to cut only where the meaning changes.
Different "Search Engines" (Retrievers): How the computer looks for the right piece.
Different "Brains" (LLMs): The AI that actually reads the piece and answers the question.

3. The Key Findings (The "Secret Sauce")

🏆 The Best Scissors (Parsing)

For Text: A tool called pdfminer was the champion. It's like a careful librarian who reads the raw text stream very accurately.
For Tables: A tool called pdfplumber won. It's like a specialist who knows exactly how to separate a spreadsheet from the surrounding text.
The "All-in-One" Trap: The fancy tool that uses heavy AI (Unstructured) was very accurate but extremely slow. It's like using a supercomputer to chop a single onion—it works, but it takes forever. For business, speed matters.

✂️ The Best Cutting Style (Chunking)

Don't cut too finely: If you cut the text into tiny, random pieces, you lose the context (like cutting a sentence in half).
The "Overlap" Trick: The researchers found that when cutting the text, you should let the pieces overlap slightly (like overlapping roof tiles). If you cut a page into chunks, make sure the last 25% of one chunk is repeated at the start of the next. This ensures that if a number or a sentence is split by the cut, the computer still sees the full picture.
The Winner: A "Neural" cutter (AI that understands meaning) worked best, but a simple "Sentence" cutter was almost as good and much cheaper/faster.

🧠 The Best Brain (The AI Model)

Bigger is better (but with limits): Bigger AI models (like GPT-5 or large open-source models) gave much better answers than small ones.
The Sweet Spot: You don't need the biggest, most expensive model to get 90% of the way there. A "medium" sized model often gives the best balance of cost and accuracy.

4. The "TableQuest" Discovery

The researchers realized that previous tests were like testing a car only on smooth highways (text). They built TableQuest to test the car on off-road terrain (tables).

Result: Many systems that were great at reading text failed miserably at reading tables. They couldn't figure out which row and column a number belonged to.
Lesson: If you want to build a financial AI, you must test it on tables, or it will fail in the real world.

5. The Final Takeaway for Business

If you are a bank or a company trying to build an AI to read financial reports:

Don't overcomplicate it: You don't need the most expensive, slowest tools. A combination of pdfplumber (for tables) and a 25% overlap when cutting text works wonders.
Overlap is key: Always let your text chunks overlap slightly so you don't lose context.
Test on tables: Don't just test if your AI can read paragraphs; test if it can read spreadsheets.

In a nutshell: This paper is a guidebook that says, "Here is the exact recipe for building a financial AI that doesn't hallucinate, doesn't get confused by tables, and does it fast enough to be useful in a real bank."

Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

1. The Problem: The "Messy Library"

2. The Experiment: The "Cooking Competition"

3. The Key Findings (The "Secret Sauce")

🏆 The Best Scissors (Parsing)

✂️ The Best Cutting Style (Chunking)

🧠 The Best Brain (The AI Model)

4. The "TableQuest" Discovery

5. The Final Takeaway for Business

1. Problem Statement

2. Methodology

Datasets

Experimental Setup

3. Key Contributions

4. Key Results & Findings

RQ1: Retriever Selection

RQ2: PDF Parsers

RQ3: Chunking Strategies

RQ4: Parser-Chunker Synergy

RQ5: LLM Impact

5. Significance and Practical Guidelines

Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

1. The Problem: The "Messy Library"

2. The Experiment: The "Cooking Competition"

3. The Key Findings (The "Secret Sauce")

🏆 The Best Scissors (Parsing)

✂️ The Best Cutting Style (Chunking)

🧠 The Best Brain (The AI Model)

4. The "TableQuest" Discovery

5. The Final Takeaway for Business

1. Problem Statement

2. Methodology

Datasets

Experimental Setup

3. Key Contributions

4. Key Results & Findings

RQ1: Retriever Selection

RQ2: PDF Parsers

RQ3: Chunking Strategies

RQ4: Parser-Chunker Synergy

RQ5: LLM Impact

5. Significance and Practical Guidelines

More like this