Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

🌟 The Big Problem: The "Rote Memorization" Trap

Imagine you are a teacher trying to test a student's math skills. If you give them the exact same 10 math problems every single day, they might get 100% on the test. But are they actually good at math? Or did they just memorize the answers?

This is exactly what is happening with AI Agents (smart computer programs that can browse the web or read documents).

The Old Way: Researchers use static, fixed datasets (the "same 10 problems"). The AI memorizes them, looks smart, but fails when faced with a new, real-world situation.
The New Way (Graph2Eval): We need a way to generate infinite, unique, and solvable problems on the fly to see if the AI can actually think, not just remember.

🕸️ The Solution: The "Knowledge Graph" as a Lego Set

The authors built a system called Graph2Eval. To understand it, imagine a massive Lego set representing the entire internet and a library of documents.

The Knowledge Graph (The Lego Baseplate):
Instead of just reading text, the system breaks everything down into tiny pieces (nodes) and connects them with lines (edges).
- Example: A "Paragraph" is a brick. A "Table" is a brick. A "Link" is a connector piece.
- This creates a structured map of how information relates to other information. It's like having a map of a city where every building and street is clearly labeled and connected.
The Task Generator (The Architect):
The system doesn't just guess what to ask the AI. It looks at the Lego map, picks a specific cluster of bricks (a subgraph), and says: "Okay, here is a specific set of related facts. Now, build a question that requires connecting these specific bricks."
- Because the bricks are already connected logically, the question is guaranteed to make sense and have an answer.

🚀 How It Works: Two Main Scenarios

The system is designed to test agents in two different "worlds":

1. The Document Reader (RAG Agents)

The Scenario: You give the AI a stack of PDFs and ask, "What does the CEO say about the budget in the 2024 report?"
The Graph2Eval Magic: It treats the document like a 3D puzzle. It finds the "Budget" piece and the "CEO" piece in its Lego map, checks they are connected, and generates a question that forces the AI to find that specific connection.
Why it's better: Old methods might ask a question where the answer doesn't exist in the text (hallucination). Graph2Eval ensures the answer is right there in the Lego structure.

2. The Web Surfer (Web Agents)

The Scenario: You tell the AI, "Go to the weather site, find the forecast for Tokyo, and click the 'Save' button."
The Graph2Eval Magic: It maps the website like a subway system. It knows that the "Search Bar" is connected to the "Results Page," which is connected to the "Save Button."
The Seed Strategy: It picks a starting point (a "seed," like a search bar) and builds a path forward. It ensures the AI isn't asked to click a button that doesn't exist or navigate to a page that isn't linked.

🛡️ The Quality Control: The "Safety Inspector"

Just because you can build a Lego tower doesn't mean it's a stable house. The system has a Multi-Stage Filter:

Reachability Check: Can the AI actually get from Point A to Point B? (Is the path open?)
Solvability Check: Is there enough information to solve the puzzle?
Uniqueness Check: Is this question too similar to one we already asked? (We want variety, not repetition.)

📊 The Results: The "Graph2Eval-Bench"

The team used this system to build a new test suite called Graph2Eval-Bench, containing 1,319 unique tasks.

The Test: They ran various AI models (like GPT-4o, Qwen, DeepSeek) through this new test.
The Outcome:
- Better Consistency: The tasks made 20% more sense than tasks generated by other methods.
- Better Solvability: The tasks were 17% easier to actually solve because the path was clear.
- True Differentiation: The test successfully separated the "smart" AIs from the "dumb" ones. Some models that looked good on old tests failed miserably here, proving they were just memorizing, not reasoning.

🏁 The Takeaway

Graph2Eval is like moving from a multiple-choice quiz (where you can guess) to a live escape room (where you have to actually solve puzzles to survive).

By using a Knowledge Graph as the blueprint, the system creates a safe, structured, and infinite playground for testing AI agents. It stops us from tricking the AI with fake questions and starts us on the path to seeing what these digital minds can really do in the real world.

Here is a detailed technical summary of the paper "Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs".

1. Problem Statement

As multimodal Large Language Model (LLM) agents become more autonomous, traditional static evaluation datasets face critical limitations:

Scalability & Static Nature: Manually annotated datasets (e.g., GAIA, Mind2Web) are labor-intensive to expand and fail to keep pace with dynamic real-world environments.
Memorization vs. Generalization: Static datasets risk measuring an agent's ability to recall memorized answers rather than its true cognitive adaptability.
Flaws in Existing Synthetic Methods: Current automated task generation methods relying solely on LLMs suffer from:
- Semantic Inconsistency: Generated tasks often lack logical coherence due to the absence of explicit entity-relation modeling.
- Unsolvable Tasks: LLM hallucinations can create tasks that are impossible to complete given the provided context.
- Lack of Dynamic Modeling: Existing methods struggle to model complex, multi-step dependencies in real-world web interactions.

2. Methodology: Graph2Eval Framework

The authors propose Graph2Eval, a framework that uses a Knowledge Graph (KG) as a structured task space to generate semantically consistent and solvable multimodal agent tasks. The workflow consists of five stages:

A. Data Ingestion & Preprocessing

Documents: Content is parsed to preserve hierarchical semantics (paragraphs, tables, figures, headings). Nodes are created with semantic embeddings and metadata (source, position).
Web Data: Automated crawling collects DOM structures and screenshots. Simulated human-like interactions navigate complex pages. Low-quality links are filtered via rule-based heuristics and LLM evaluation.

B. Knowledge Graph (KG) Construction

The unstructured data is transformed into a computable graph $G = (V, E, R)$ :

Nodes ( $V$ ): Represent elements like paragraphs, headings, buttons, forms, and images. Visual content is converted to text descriptions and embedded.
Edges ( $E$ ): Capture heterogeneous relations:
- Textual: Structural (sequence, contains), semantic (similarity), and contextual relations.
- Web-specific: Navigation (page-to-page), interaction (click triggers), and layout relations.

C. Subgraph Sampling

Tasks are generated by sampling local subgraphs from the KG based on specific strategies:

Document Understanding: Uses semantic relevance (embedding similarity) and structural coherence to select nodes (e.g., entities, chunks).
Web Interaction: Uses a Seed-Driven Strategy. Key operational nodes (buttons, forms) are identified as "seeds," and their $k$ -hop neighbors are collected to capture the local interaction context.

D. Task Generation

Sampled subgraphs are converted into executable tasks using:

Task Templates: A library of templates (e.g., QA, comparison, reasoning) defines the required node/edge types and difficulty.
Meta-Path Strategies (Web): Patterns (e.g., SearchBox -> Button -> Result) are matched against the subgraph to define task chains.
Context Engineering: LLMs integrate the subgraph structure with the template to generate concrete task prompts and reference answers.

E. Coverage Optimization

A multi-stage filtering pipeline ensures quality:

Metrics: Consistency (semantic alignment with source), Solvability (can the task be completed?), and Novelty.
Selection: Uses Maximal Marginal Relevance (MMR) to select diverse tasks, avoiding redundancy.

3. Key Contributions

KG-Driven Task Space: A novel approach treating multi-source data as a latent task space via Knowledge Graphs, ensuring explicit modeling of entity relationships to improve semantic consistency and solvability.
Unified Framework (Graph2Eval): A reproducible pipeline supporting both RAG Agents (Document Understanding) and Web Agents (Web Interaction) with distinct generation strategies for each scenario.
Graph2Eval-Bench: A curated dataset of 1,319 tasks (1,002 document tasks, 317 web tasks) designed to evaluate agent performance across various LLM configurations.
Efficiency & Automation: The framework automates dataset creation, significantly reducing the time cost compared to manual annotation while maintaining high quality.

4. Experimental Results

The authors evaluated Graph2Eval-Bench using various models (GPT-4o, DeepSeek, Qwen-VL, Gemini) and agent architectures (Single/Multi-agent, SoM, Agent S 2.5).

Quality Improvements: Compared to a KG-free baseline, Graph2Eval improved Semantic Consistency by 20% and Solvability by 17%.
Differentiation Capability: The benchmark effectively distinguishes performance across models of different scales. For example, in web tasks, Agent S 2.5 (with reflection/memory) achieved a 69.20% success rate, significantly outperforming the SoM Agent (14.51%), highlighting the benchmark's ability to test advanced reasoning capabilities.
Efficiency: The system generates document tasks in ~35 seconds and web tasks in ~96 seconds on average, far faster than manual construction.
Ablation Studies: Removing the KG resulted in tasks that were often single-page interactions (unsolvable for multi-step workflows) and required extensive manual review to fix ambiguities.

5. Significance

Beyond Static Evaluation: Graph2Eval addresses the "memorization trap" by generating dynamic, diverse tasks that test an agent's ability to generalize and reason in real-time.
Scalable Benchmarking: It provides a scalable solution for creating evaluation datasets for the rapidly evolving field of agentic AI, reducing reliance on expensive human annotation.
Safety & Robustness: The framework includes mechanisms for safety filtering (avoiding PII, sensitive data) and lays the groundwork for future adversarial testing.
Interpretability: By grounding tasks in a KG, the framework allows for fine-grained error analysis (e.g., identifying if an agent failed due to a missing node or a broken edge relation), offering deeper insights into agent weaknesses.

In conclusion, Graph2Eval represents a paradigm shift from static, human-curated benchmarks to automated, structurally grounded task generation, offering a more rigorous and scalable standard for evaluating the true capabilities of multimodal agents.