Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

This paper presents an empirical study evaluating various PDF parsers and chunking strategies within Retrieval-Augmented Generation (RAG) systems for financial question answering, introducing a new benchmark called TableQuest to provide practical guidelines for building robust document understanding pipelines.

Omar El Bachyr, Yewei Song, Saad Ezzini, Jacques Klein, Tegawendé F. Bissyandé, Anas Zilali, Ulrick Ble, Anne Goujon

Published 2026-04-15
📖 5 min read🧠 Deep dive

Imagine you are a financial analyst trying to find a specific number in a massive, 200-page PDF report about a company's earnings. The problem? That PDF is designed for humans to read, not for computers to understand. It's like a beautifully decorated book where the text, tables, and charts are all mixed together without clear labels. If you ask a computer to "find the profit margin," it might get confused, looking at the wrong page or mixing up numbers from different tables.

This paper is essentially a scientific taste test to figure out the best way to teach computers how to read these messy financial PDFs and answer questions accurately.

Here is the breakdown using simple analogies:

1. The Problem: The "Messy Library"

Think of a financial PDF as a library where all the books have been shredded and thrown into a giant pile.

  • The Goal: You want to find a specific sentence (e.g., "What was the revenue in 2023?").
  • The Challenge: Before a computer can find that sentence, it has to:
    1. Reassemble the pages (Parsing): Figure out which text belongs to a table and which belongs to a paragraph.
    2. Cut the pages into bite-sized pieces (Chunking): Computers can't read a whole 200-page book at once; they have a "short attention span" (limited memory). So, you have to cut the text into small chunks.
    3. Find the right piece (Retrieval): When you ask a question, the system needs to grab the right chunk immediately.

2. The Experiment: The "Cooking Competition"

The researchers set up a massive experiment to see which "chefs" (tools) and "cutting styles" (strategies) work best. They tested this on two types of "recipes":

  • FinanceBench: Questions based on text (like reading a story).
  • TableQuest (New!): Questions based on tables (like reading a spreadsheet). This is a big deal because most previous tests ignored tables, which is where the real money data usually hides.

They tested:

  • 6 Different "Scissors" (PDF Parsers): Some are basic scissors (just cut text), while others are high-tech laser cutters that can see tables and images.
  • 6 Different "Cutting Styles" (Chunking): Some cut every 50 words, some cut at every sentence, and some use AI to cut only where the meaning changes.
  • Different "Search Engines" (Retrievers): How the computer looks for the right piece.
  • Different "Brains" (LLMs): The AI that actually reads the piece and answers the question.

3. The Key Findings (The "Secret Sauce")

🏆 The Best Scissors (Parsing)

  • For Text: A tool called pdfminer was the champion. It's like a careful librarian who reads the raw text stream very accurately.
  • For Tables: A tool called pdfplumber won. It's like a specialist who knows exactly how to separate a spreadsheet from the surrounding text.
  • The "All-in-One" Trap: The fancy tool that uses heavy AI (Unstructured) was very accurate but extremely slow. It's like using a supercomputer to chop a single onion—it works, but it takes forever. For business, speed matters.

✂️ The Best Cutting Style (Chunking)

  • Don't cut too finely: If you cut the text into tiny, random pieces, you lose the context (like cutting a sentence in half).
  • The "Overlap" Trick: The researchers found that when cutting the text, you should let the pieces overlap slightly (like overlapping roof tiles). If you cut a page into chunks, make sure the last 25% of one chunk is repeated at the start of the next. This ensures that if a number or a sentence is split by the cut, the computer still sees the full picture.
  • The Winner: A "Neural" cutter (AI that understands meaning) worked best, but a simple "Sentence" cutter was almost as good and much cheaper/faster.

🧠 The Best Brain (The AI Model)

  • Bigger is better (but with limits): Bigger AI models (like GPT-5 or large open-source models) gave much better answers than small ones.
  • The Sweet Spot: You don't need the biggest, most expensive model to get 90% of the way there. A "medium" sized model often gives the best balance of cost and accuracy.

4. The "TableQuest" Discovery

The researchers realized that previous tests were like testing a car only on smooth highways (text). They built TableQuest to test the car on off-road terrain (tables).

  • Result: Many systems that were great at reading text failed miserably at reading tables. They couldn't figure out which row and column a number belonged to.
  • Lesson: If you want to build a financial AI, you must test it on tables, or it will fail in the real world.

5. The Final Takeaway for Business

If you are a bank or a company trying to build an AI to read financial reports:

  1. Don't overcomplicate it: You don't need the most expensive, slowest tools. A combination of pdfplumber (for tables) and a 25% overlap when cutting text works wonders.
  2. Overlap is key: Always let your text chunks overlap slightly so you don't lose context.
  3. Test on tables: Don't just test if your AI can read paragraphs; test if it can read spreadsheets.

In a nutshell: This paper is a guidebook that says, "Here is the exact recipe for building a financial AI that doesn't hallucinate, doesn't get confused by tables, and does it fast enough to be useful in a real bank."

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →