Beyond Text and Tables: Vision-Language Model… — Plain-Language Explanation

Original authors: Aritra Roy, Enrico Grisan, Chiara Gattinoni, John Buckeridge

Published 2026-06-02

📖 4 min read☕ Coffee break read

Original authors: Aritra Roy, Enrico Grisan, Chiara Gattinoni, John Buckeridge

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine the world of materials science as a massive, chaotic library. Inside this library are millions of books (scientific papers) containing the secrets to new materials—like stronger alloys, better batteries, or more efficient ceramics.

For a long time, computers trying to read these books had a major blind spot. They were excellent at reading the text and the tables (the spreadsheets), but they were completely illiterate when it came to the pictures. In materials science, crucial data is often hidden inside graphs and charts. If a computer couldn't "see" the graph, that data was lost, locked away in a visual format the machine couldn't understand.

This paper introduces a major upgrade to a tool called ComProScanner. Think of ComProScanner as a super-fast, tireless librarian robot. Previously, this robot could only read the words and numbers written in sentences or tables. Now, the authors have given it eyes and a brain capable of understanding images.

Here is how the new system works, broken down into simple concepts:

1. The New "Eyes" (Vision-Language Models)

The authors equipped the robot with a special type of artificial intelligence called a Vision-Language Model (VLM).

The Analogy: Imagine you are trying to teach a robot to read a map. A normal robot can read the street names (text), but it can't tell you how steep the hills are just by looking at the squiggly lines on the map. The new VLM is like a human guide who can look at the squiggly lines, understand they represent hills, and tell you exactly how high they are.
The Job: This new "eye" scans the scientific figures, reads the axes and labels, and extracts the specific numbers hidden inside the curves and bars.

2. The Smart Filter (FigureExtractor)

The library has millions of pages, and not every page has a useful graph. Scanning every single image would be a waste of time and money.

The Analogy: Before the robot starts reading every picture in the library, it has a smart assistant called the FigureExtractor. This assistant looks at the captions (the titles under the pictures) and keywords. If the caption says "Piezoelectric Coefficient," the assistant flags it as important. If it says "Author Biography," it ignores it.
The Result: The robot only spends its energy on the graphs that actually matter.

3. The "Budget" Test (Model Selection)

The authors didn't just pick the most powerful AI available; they had to be smart about cost. Using AI costs money (based on how much "thinking" it does).

The Analogy: Imagine you are hiring four different detectives to solve a case. You want the best detective, but you also have a strict budget. You can't hire the most expensive one if it costs a fortune.
The Result: They tested four top-tier "detectives" (AI models). They found that Gemini-3-Flash-Preview was the winner. It was the most accurate at reading the graphs and the cheapest to run. It was like finding a detective who solved the case perfectly but charged less than the others.

4. The "Fuzzy" Math (Value Error Thresholds)

Reading a number off a printed graph isn't always perfect. If a line is between 10 and 11, is it 10.4 or 10.6?

The Analogy: If you ask a human, "How tall is that building?" they might say "About 50 feet." If you demand they say "Exactly 50.000 feet," they might get it wrong because the drawing isn't precise enough.
The Innovation: The authors added a new rule to the evaluation. Instead of demanding a perfect match (e.g., 10.00 vs 10.00), they allow a small "wiggle room" (e.g., 10.00 vs 10.5 is still a pass). This makes the test more realistic, acknowledging that reading a graph always involves a tiny bit of estimation.

The Big Achievement

Before this paper, ComProScanner was a tool that could only read text and tables. Now, it is a fully multimodal tool.

The Metaphor: It's like upgrading a car from one that only drives on paved roads (text/tables) to an all-terrain vehicle that can drive on roads, dirt paths, and rocky hills (text, tables, and figures).

The Bottom Line:
The authors successfully built a system that can automatically find, read, and extract data from scientific graphs across many different publishers. They proved that by using the right AI model (Gemini-3-Flash-Preview) and allowing for small measurement errors, they can turn messy, visual scientific data into clean, organized digital data without a human needing to type it in manually. This is the first time such a complete, automated system has been built specifically for materials science.

Technical Summary: Vision-Language Model Integration in ComProScanner

Problem Statement
The scale and quality of materials datasets are critical for data-driven materials discovery, yet existing databases fail to capture the vast majority of experimentally measured properties found in scientific literature. While computational repositories (e.g., Materials Project, JARVIS-DFT) provide high-throughput DFT data, experimental data for functional ceramics, alloys, and polymers remain trapped in unstructured formats across millions of journal articles. Previous automated extraction frameworks, including the authors' own ComProScanner, have successfully handled textual and tabular data but have overlooked a substantial proportion of quantitative property data reported exclusively in scientific figures. Current solutions for figure extraction rely on specialized digitization tools or emerging vision-language models (VLMs), but no unified, end-to-end framework existed to extract composition-property data from figures within a single automated pipeline alongside text and tables.

Methodology
The authors extend the ComProScanner framework, a fully end-to-end multi-agent system for automated database construction, by integrating native VLM-based figure extraction capabilities. The technical implementation involves two primary mechanisms:

Figure Filtering and Preprocessing: A FigureExtractor utility was introduced to filter relevant figures across all supported publishers based on caption keywords (e.g., piezoelectric coefficient $d_{33}$ , XRD patterns). This utility handles JPEG conversion and is shared across publisher processors to reduce API costs.
Graph Extraction Agent: A GraphExtractorTool (a CrewAI BaseTool) was developed to process saved figures. Given a Digital Object Identifier (DOI), this agent reads all saved figures for an article and passes them to a configurable VLM using a structured extraction prompt. The VLM returns composition-property value pairs in the standard ComProScanner JSON schema.
Image-Aware Fallback: The DataExtractionFlow was updated to include an image-aware fallback mechanism. If the initial text-based Retrieval-Augmented Generation (RAG) fails to identify relevant data, the flow checks saved DOI figures via the VLM. If relevant graphical evidence is found, the decision is upgraded to "yes," preventing articles with graph-only data from being discarded.
Model Selection Criteria: Four VLMs were selected for evaluation based on the LMArena Diagram leaderboard (ranking human preference on diagram understanding) and a strict cost criterion of less than $1.50 per million input tokens. The selected models were Gemini-3-Flash-Preview, Gemini-2.5-Pro, GPT-5-Chat-Latest, and GPT-5.1.
Evaluation Framework: The system was benchmarked on 50 randomly selected piezoelectric ceramic articles from an established $d_{33}$ test corpus. The evaluation focused exclusively on the composition_property_values field. To address the inherent uncertainty in reading values from charts, the authors introduced a range-based value error threshold parameter (e.g., $\pm 0.5, \pm 1, \pm 2$ pC/N) rather than relying solely on exact value matching.

Key Contributions

First Multimodal End-to-End Pipeline: The work establishes VLM-integrated ComProScanner as the first materials-specific, fully automated platform capable of extracting structured composition-property data from text, tables, and figures within a single unified pipeline.
Novel Utility and Agent Tools: The introduction of the FigureExtractor utility for caption-based filtering and the GraphExtractorTool agent for VLM-driven data recovery.
Enhanced Evaluation Metrics: The inclusion of a range-based value error threshold parameter, providing a more physically meaningful assessment of numeric property values extracted from figures compared to strict exact matching.
Cost-Effective Model Benchmarking: A rigorous comparison of four VLMs demonstrating that high-performance models can be selected based on a balance of accuracy and input token cost.

Results
Benchmarking on the 50-article subset yielded the following findings:

Performance: Gemini-3-Flash-Preview achieved the highest performance across all dimensions, with a composition accuracy of 0.97 and a normalized F1 score of 0.97. It also demonstrated the highest precision (0.96) and recall (0.95).
Comparative Performance: Gemini-2.5-Pro performed respectably with a composition accuracy of 0.86 and normalized F1 of 0.84, though it showed a lower recall relative to precision, suggesting a more conservative extraction strategy. GPT-5-Chat-Latest and GPT-5.1 performed comparably to each other but lagged significantly behind the Gemini models, with composition accuracies of 0.78 and normalized F1 scores around 0.71–0.72.
Cost-Efficiency: Gemini-3-Flash-Preview was identified as the most cost-effective model, offering the highest performance while commanding a substantially lower input cost per million tokens than its competitors.
Data Recovery: Of the 50 selected articles, 48 yielded evaluable data after extraction and cleaning. The image-aware fallback successfully prevented the silent discarding of articles containing graph-only data.

Significance
The paper claims that these contributions establish a new standard for materials informatics by bridging the gap between published literature and machine-ready datasets for experimental data. By demonstrating that cost-effective VLMs are sufficiently capable for large-scale deployment, the authors argue that the systematic gap in existing literature mining frameworks—specifically the inability to process graphical data—has been addressed. The resulting platform enables the automated recovery of composition-property pairs from scientific charts and plots across all supported publishers, facilitating the creation of comprehensive, multimodal materials databases without human intervention. The work concludes that the integration of VLMs into the ComProScanner pipeline represents a decisive step toward fully automated, scalable materials data extraction.

Beyond Text and Tables: Vision-Language Model Integration in ComProScanner for Extracting Materials Data from Scientific Figures with High Accuracy