HCT-QA: A Benchmark for Question Answering on Human-Centric Tables

Imagine you have a massive library of documents. Inside these documents are thousands of tables. Some are simple spreadsheets, but many are Human-Centric Tables (HCTs).

Think of HCTs like hand-painted maps or complex family tree charts designed for a human to read with their eyes. They have merged cells, nested headers, colorful highlights, and "Total" rows that span across different sections. They are beautiful and informative for a person, but they are a nightmare for a computer. A computer usually expects data to be in a strict, grid-like format (like a spreadsheet) to ask questions like, "What was the total sales in 2023?"

The Problem: The "Translation" Gap

Traditionally, to ask a computer about these fancy tables, we tried to force them into a rigid spreadsheet format first. It's like trying to fit a round, artistic watercolor painting into a square, rigid plastic frame. The computer would often break the painting, lose the colors, or get confused by the layout.

Then, AI (Large Language Models) came along. These are the "super-readers" that can look at a picture of a table and understand it. But here's the catch: We didn't have a good test to see how well they actually worked. It was like giving a student a final exam without a practice test or a grading rubric. We didn't know if the AI was actually smart or just guessing.

The Solution: HCT-QA (The "Gym" for Table AI)

The authors of this paper built HCT-QA, which is essentially a massive, high-tech gym and testing ground for AI models to practice reading these tricky tables.

Here is what they built:

The Workout Equipment (The Data):
- They collected 1,880 real-world tables from places like government census reports, scientific journals, and planning councils. These are the "real" messy tables.
- They also built a Synthetic Generator (a "table factory"). This is a robot that can instantly create thousands of new fake tables with specific tricky features (like nested headers or hidden totals) to test the AI's limits.
- Total: Over 6,500 tables and nearly 80,000 questions (like "What is the average temperature in July?" or "Which country had the highest export?").
The Grading System (The Benchmark):
- They didn't just ask the AI questions; they created a detailed "scorecard." They tracked why an AI got a question wrong. Was it because the table was too big? Was it because the headers were nested? Was it because the question required doing math (averaging) rather than just looking up a number?

The Experiments: Who Won the Race?

They put 34 different AI models (both text-only models and Vision-Language models that can "see" images) through the HCT-QA test.

The Heavyweights: The biggest, most expensive AI models (like ChatGPT-4o) performed the best, but they still made mistakes. They got about 66% of the answers right. That sounds good, but for a computer, it means it's failing 1 out of every 3 questions.
The Visionaries: Models that can see the table as an image (Vision-Language Models) did surprisingly well. Sometimes, looking at the picture of the table was better than reading the text code, because the AI could see the visual clues (like bold text or colors) that get lost when you convert a table to text.
The Training Effect: The biggest surprise? When they fine-tuned (trained specifically) a smaller, cheaper AI model using their new HCT-QA data, that small model jumped up in performance by 25%. It's like taking a smart student and giving them a specific textbook for a week; they suddenly ace the exam.

The Key Takeaways (In Plain English)

Current AI isn't perfect yet: Even the best AI models struggle with complex, human-designed tables. They get confused by the layout.
Visuals matter: Sometimes, showing the AI the picture of the table is better than giving it the text of the table. The visual layout holds clues that text misses.
Practice makes perfect: If you train an AI specifically on these tricky tables, it gets much better. You don't always need the biggest, most expensive model; a smaller one trained on the right data can beat a giant one that hasn't seen this type of data before.
The "Table Factory" is open source: The authors released their "table factory" (the synthetic generator) to the public. This means other researchers can now build their own training sets without having to spend years manually collecting tables.

The Analogy Summary

Think of HCTs as ancient, handwritten recipes with weird abbreviations and crossed-out ingredients.

Old AI was like a robot that only understands typed, standard recipes. It would try to force the handwritten recipe into a standard format and end up with a burnt meal.
HCT-QA is a cooking school where they give the robots thousands of these handwritten recipes and ask them to cook the dish.
The Result: The robots are getting better at reading the handwriting, especially if they are trained specifically on these recipes. But they still sometimes burn the toast when the recipe is too messy!

This paper is a huge step forward because it finally gives us a way to measure how good our "cooking robots" really are and provides the tools to make them even better.

Here is a detailed technical summary of the paper "HCT-QA: A Benchmark for Question Answering on Human-Centric Tables."

1. Problem Statement

The paper addresses the challenge of Question Answering (QA) over Human-Centric Tables (HCTs). Unlike traditional relational tables (flat, schema-based), HCTs are tabular data embedded in PDFs, web pages, and reports designed primarily for human readability. They exhibit complex structural and semantic layouts, including:

Nesting: Multi-level column and row headers (hierarchical structures).
Aggregations: Embedded totals, averages, or sums within the table body or headers.
Visual Cues: Use of bold text, colors, indentation, and split headers to convey meaning.

Current Limitations:

SQL-based approaches fail: Existing NL-to-SQL pipelines rely on transforming tables into relational formats. However, complex HCTs are difficult to convert accurately, and the transformation pipeline is error-prone.
Lack of Benchmarks: While Large Language Models (LLMs) and Vision Language Models (VLMs) offer a promising direct approach to querying HCTs, there is no standardized benchmark to evaluate their performance, analyze their limitations, or compare different architectures.
Data Scarcity: Existing benchmarks (e.g., HiTab, MultiHiertt) focus on simpler structures, lack comprehensive metadata for analysis, or rely heavily on manual effort that limits scalability.

2. Methodology

The authors propose HCT-QA, a comprehensive benchmark and a synthetic data generation framework.

A. Dataset Construction

The benchmark consists of two main components:

Real-World HCTs (1,880 tables):
- Sources: Qatar National Planning Council, ArXiv/bioRxiv/medRxiv, US Census, and Pakistan Bureau of Statistics.
- Formats: Provided as Images, CSV, HTML, and Markdown.
- QA Pairs: 9,835 questions generated via a hybrid approach:
  - Manual: Created by database experts (3,248 pairs).
  - Model-Assisted: Generated by LLMs (GPT-4) and rigorously verified by human annotators (6,587 pairs).
- Metadata: Each table and question is annotated with fine-grained properties (e.g., balanced/unbalanced nesting, symmetric/asymmetric structures, explicit/implicit aggregations).
Synthetic HCTs (4,679 tables):
- Generator: A novel, configurable synthetic data generator that creates HCTs and QA pairs across 7 semantic domains (e.g., Food Import/Export, Accidents, Weather).
- Workflow:
  1. Define a domain vocabulary (attributes and values).
  2. Generate a relational table ( $T_{REL}$ ) and execute SQL queries ( $Q_{SQL}$ ) to get ground truth answers.
  3. Pivot and style $T_{REL}$ into complex HCTs ( $T_{HCT}$ ) using templates that mimic real-world layouts (nesting, aggregation).
  4. Transcribe $Q_{SQL}$ into Natural Language questions ( $Q_{NL}$ ) using domain-specific templates.
- Scale: 67,747 QA pairs. This ensures 100% semantic correctness and allows for massive scalability.

B. Evaluation Framework

Models Tested: 25 LLMs (ranging from 3B to 100B+ parameters, including OpenAI, Llama, Qwen, Gemma, Mistral) and 9 VLMs (e.g., Pixtral, InternVL, Phi-3-Vision).
Metrics:
- F1 Score: Rewards partial correctness.
- Complete Containment (CC) Score: Binary score (1 if the model's answer fully contains the ground truth, 0 otherwise).
Experimental Setup: Zero-shot and one-shot prompting; evaluation on text (HTML/CSV/Markdown) and image modalities; fine-tuning experiments.

3. Key Contributions

HCT-QA Benchmark: The largest and most diverse benchmark for HCT QA, containing ~6,559 tables and ~77,500 QA pairs, significantly outperforming existing benchmarks in size and complexity.
Rich Metadata: Comprehensive annotation of table properties (nesting, aggregation types) and question properties (filtering, ranking, aggregation), enabling granular analysis of model failures.
Synthetic Generator: A scalable tool that generates semantically correct HCTs and QA pairs, reducing the manual effort required to expand datasets to new domains.
Extensive Empirical Analysis: Evaluation of 34 models, revealing insights into the impact of model size, input format, fine-tuning, and specific structural complexities.

4. Key Results & Findings

Model Performance

State-of-the-Art: Closed-weight large models (e.g., ChatGPT-4o) perform best overall (F1 ~66% on text), but significant room for improvement remains.
Open-Weight Competitiveness: Qwen2.5-72B achieves performance comparable to ChatGPT-4o (F1 ~62.9%) despite being open-weight.
VLM Potential: VLMs show promise, with Pixtral-12B outperforming many similarly sized LLMs. However, VLMs struggle significantly with larger tables (e.g., US Census data) due to context limits or visual complexity.
Small Models: Small LLMs (e.g., Gemma-3-12B) can outperform older large models (ChatGPT-3.5) but struggle with complete answer containment (CC score).

Impact of Input Format

HTML vs. Text: For text-only LLMs, HTML format yields the best results (preserving structural cues better than CSV/Markdown).
Obfuscation: Removing semantic meaning (obfuscating text) significantly drops performance, proving models rely on semantic context, not just structure.

Fine-Tuning

Significant Gains: Fine-tuning Llama-3.1-8B on HCT-QA data improved F1 scores by ~24 percentage points on real-world data and ~55 points on synthetic data compared to the base model.
Synthetic Generalization: Fine-tuning only on synthetic data generalized well to real-world HCTs, validating the utility of the synthetic generator.

Structural & Question Complexity

Nesting is Hard: Unbalanced and asymmetric nesting patterns cause the largest drops in F1 scores.
Aggregation is Hard: Models struggle most with averaging and complex aggregations. They perform best on simple selection and yes/no questions.
Model Size Correlation: Larger models generally perform better, but the marginal gain decreases. Fine-tuning smaller models is a more efficient strategy than relying solely on massive off-the-shelf models.

5. Significance

Bridging the Gap: HCT-QA fills a critical void in the literature by providing a standard for evaluating AI on real-world, non-relational tabular data.
Practical Utility: The benchmark and synthetic generator enable researchers to rapidly prototype and evaluate models for domains like finance, government reporting, and scientific literature where HCTs are prevalent.
Insight into AI Limitations: The study highlights that while LLMs/VLMs are powerful, they still lack robust reasoning capabilities for complex hierarchical structures and multi-step aggregations, suggesting a need for specialized architectures or better fine-tuning strategies.
Future Directions: The authors propose extending the benchmark to cross-table queries (joins/unions) and more advanced OLAP operations (CUBE, PIVOT), which current models fail to handle.

The paper concludes that while current models show promise, the complexity of HCTs requires a shift from simple NL-to-SQL pipelines to robust, fine-tuned multimodal approaches, supported by high-quality, diverse benchmarks like HCT-QA.