VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

Imagine you have a very smart, well-read robot assistant. You can show it a picture, and it can tell you what's in it, like a bird or a car. That's the "old" way these AI models worked.

But now, we want these robots to be doers, not just lookers. We want them to take a messy photo, fix it, measure things, count objects, and solve puzzles using a toolbox of digital instruments.

This paper introduces VTC-Bench, which is essentially a gym for these robot assistants. It's a rigorous test designed to see if they can actually use their tools effectively, or if they just pretend to know what they're doing.

Here is the breakdown in simple terms:

1. The Problem: The "Toolbox" is Too Easy

Imagine you give a chef a kitchen.

Old Tests: The chef only had to chop one carrot and stir one pot. If they did that, we said, "Great chef!"
Real Life: A real chef might need to peel a potato, chop an onion, sauté it, season it, and then plate it. They need to chain these actions together perfectly.

The paper says current AI benchmarks are like the "one carrot" test. They are too simple. They don't test if the AI can handle a complex recipe where it needs to use 5 or 6 different tools in a specific order to get the right answer.

2. The Solution: The "OpenCV" Gym

The researchers built VTC-Bench (VisualToolChain-Bench).

The Toolkit: They gave the AI a massive digital toolbox containing 32 different tools (like a digital Swiss Army Knife). These tools can rotate images, brighten them, cut out shapes, count pixels, or find edges.
The Workout: They created 680 tricky challenges. These aren't just "What color is the car?" questions. They are things like:
- "This photo is blurry and upside down. Fix it, then count how many red cars are in the parking lot."
- "This chart is hard to read. Clean up the contrast, measure the bars, and tell me which one is the biggest."
The Scoring: They don't just check the final answer. They check the recipe. Did the AI use the right tools in the right order? Did it waste time using a hammer to crack a nut?

3. The Results: The Robots Are Still Learning to Cook

The researchers tested 19 of the smartest AI models (including big names like GPT-4o, Gemini, and Qwen) in this gym. The results were a bit of a reality check:

The "Smart" Ones Struggle: Even the most advanced models only got about 51% of the questions right. That's barely passing a high school exam.
The "Tool Illusion": Many models act like they are using tools, but they often pick the wrong ones. It's like a chef trying to boil water with a blender. They might say, "I'm going to use the 'Zoom In' tool," but then they forget to actually zoom in before trying to measure something.
The "Shortcut" Habit: When things get hard, the models tend to give up on the complex plan. Instead of using 5 tools to solve a puzzle, they try to guess the answer using just 1 or 2 familiar tools they know well. They get stuck in a rut.
Closed vs. Open: The "closed" models (like those from Google and OpenAI) did a bit better than the "open" ones (community-built models), but even the best ones struggled with long, complex chains of actions.

4. Why This Matters

Think of this like teaching a child to drive.

Old Way: We let them drive in an empty parking lot at 5 mph. They passed!
VTC-Bench Way: We put them on a busy highway with rain, construction, and merging traffic.

The paper shows that while our AI "drivers" are getting better at looking at the road, they are still terrible at navigating complex traffic. They can't yet plan a long journey involving multiple turns, stops, and tool uses without getting confused.

The Takeaway

VTC-Bench is a wake-up call. It tells us that to make AI truly useful in the real world (like fixing photos, analyzing medical scans, or helping engineers), we can't just make the AI "smarter" in general. We have to teach it how to plan, how to chain tools together, and how to admit when it needs to try a different approach when the first one fails.

Until the models can pass this "gym" test, they are still more like tourists looking at a map than explorers actually navigating the terrain.

1. Problem Statement

Multimodal Large Language Models (MLLMs) have evolved from passive visual question answering (VQA) systems to active "agentic" models capable of utilizing external tools. However, current benchmarks fail to adequately evaluate these models' ability to compose diverse tools and execute long-horizon, multi-step plans required for complex real-world visual tasks.

Limitations of Existing Benchmarks: Most existing benchmarks rely on sparse tool sets (often <10 tools) and simple, single-step invocations. They do not capture the complexity of chaining multiple distinct operations (e.g., preprocessing, feature extraction, and measurement) to solve a single problem.
The Gap: There is a critical lack of evaluation frameworks that test a model's ability to adapt to a diverse array of tools, generalize to unseen operations, and formulate efficient execution plans without relying on a narrow subset of familiar functions.

2. Methodology: VTC-Bench Framework

The authors introduce VisualToolChain-Bench (VTC-Bench), a comprehensive benchmark designed to rigorously assess the tool-use proficiency of MLLMs.

A. Tool Set and Architecture

Tool Source: The benchmark utilizes 32 distinct visual operations derived from the OpenCV library, chosen to mimic authentic computer vision pipelines.
Categorization: The tools are organized into four functional modules:
1. Geometry: Spatial transformations (e.g., Rotate, Crop, Flip, Pyramid).
2. Enhancement: Signal optimization (e.g., Color conversion, Binarization, Histogram Equalization, Denoising).
3. Feature Extraction: Deriving structural/semantic primitives (e.g., Edge detection, Watershed, Connected Components, Hough transforms).
4. Drawing: Reasoning verification and quantification (e.g., Contour visualization, Area/Perimeter measurement).
Interaction Paradigms: Models can interact via Code Execution (writing Python/OpenCV scripts) or Interface-Driven calls (invoking predefined atomic functions).

B. Task Taxonomy and Dataset

Dataset Size: 680 curated Visual Question Answering (VQA) problems.
Cognitive Hierarchy: Tasks are structured into a 9-category, 3-tier cognitive hierarchy to test progressive reasoning:
- Tier 1 (Visual Perception Enhancement): Robust OCR, Perceptual Restoration (haze/low-light), and Attention Focusing (geometric distortions).
- Tier 2 (Quantitative Visual Estimation): Measurement, Color analysis, and Counting (requiring precise tool selection rather than intrinsic estimation).
- Tier 3 (Compositional Visual Reasoning): Chart analysis, Math problems, and Spatial Reasoning (requiring complex logical deduction and multi-step orchestration).
Ground Truth: Every problem includes a ground-truth execution trajectory (a reference toolchain) to enable precise evaluation of intermediate planning, not just final answers.
Statistics: The average toolchain length is 5.04 steps, with a median of 5 and a maximum of 10, involving an average of 4.97 unique tools per task.

C. Evaluation Metrics

The paper proposes several metrics beyond simple accuracy:

Average Pass Rate (APR): Standard accuracy.
Tool Call Rate (TCR): Proportion of tasks where at least one tool is invoked.
Mean Absolute Error (MAE): Discrepancy between the predicted toolchain length and the ground truth.
Tool Usage Efficiency ( $Eff_{tool}$ ): Ratio of effective steps to total predicted steps, measuring conciseness and redundancy.

3. Key Contributions

First Large-Scale Compositional Benchmark: VTC-Bench is the first benchmark to systematically evaluate multi-tool composition with a rich set of 32 OpenCV-based tools, moving beyond simple single-step tool use.
Hierarchical Cognitive Design: The 3-tier taxonomy (Perception $\to$ Estimation $\to$ Reasoning) provides a granular view of model capabilities, distinguishing between basic image recovery and complex logical deduction.
Rigorous Verification Protocol: The dataset construction involves a multi-stage verification process (human annotation + LLM validation + expert cross-check) to ensure high-quality ground-truth trajectories.
Dual-Paradigm Evaluation: It evaluates models under both Code-driven and Interface-driven settings, offering insights into how programming proficiency affects orchestration.

4. Experimental Results

The authors evaluated 19 leading MLLMs, including proprietary (GPT-o3, GPT-4o, Gemini-3.0-Pro) and open-source models (Qwen3-VL, DeepEyes, Thyme).

Overall Performance: Performance is generally low. Even the best-performing model, Gemini-3.0-Pro, achieved only 51.18% (Code setting) and 51.03% (Interface setting).
Proprietary vs. Open-Source:
- Proprietary models show significant gains when augmented with tools (e.g., GPT-4o improved by +9.56% in interface mode).
- Open-source models often show minimal gains or even performance degradation, indicating a gap in native tool-use capabilities.
Key Findings on Limitations:
- Tool Utilization Bias: Models heavily rely on a narrow subset of familiar tools (e.g., Zoom In, Crop, Rotate) and struggle to generalize to complex or less common operations.
- Inefficient Planning: Models frequently generate suboptimal, redundant toolchains. For instance, GPT-5.2 had a Tool Usage Efficiency of only 16.78%, meaning most of its tool calls were ineffective or redundant compared to the ground truth.
- Failure Modes:
  1. Strategic Misselection: Choosing inappropriate tools (e.g., drawing lines instead of measuring) based on flawed intrinsic perception.
  2. Over-reliance on Intermediate Outputs: Blindly accepting tool outputs without cross-verifying against the original image, leading to error propagation.
- Prompt Sensitivity: Providing ground-truth tool hints improved performance slightly but did not solve the fundamental bottleneck of reasoning logic, suggesting models struggle to synthesize the correct multi-step execution flow even with oracle knowledge.

5. Significance and Future Directions

Benchmarking Standard: VTC-Bench establishes a rigorous baseline that exposes the "illusion of competence" in current agentic models, highlighting that high VQA scores do not translate to effective tool orchestration.
Research Guidance: The results indicate that future MLLM development must focus on:
- Improving long-horizon planning and compositional reasoning.
- Enhancing tool diversity and the ability to select optimal tools for unseen scenarios.
- Bridging the gap between intrinsic perception and active tool execution.
Real-World Applicability: By simulating authentic computer vision pipelines, VTC-Bench provides a more realistic testbed for deploying MLLMs in industrial and scientific applications where complex, multi-step visual processing is required.

In conclusion, VTC-Bench reveals that while MLLMs are becoming better at "seeing," they remain significantly challenged in "acting" via complex, compositional tool chains. The benchmark serves as a critical diagnostic tool to guide the next generation of truly generalized visual agents.