TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics

Imagine you have a very smart robot assistant. This robot has excellent eyes (cameras) and a very chatty brain (a large language model). If you ask it, "Is that cup to the left of the plate?" it can answer perfectly. It's great at qualitative reasoning—understanding relationships like "left," "right," "above," or "near."

But if you ask it, "Move the cup exactly 5 centimeters to the right," the robot gets confused. It might guess, "Maybe 3 centimeters? Or maybe 10?" It lacks the internal calculator to do precise math. It's like a person who can tell you a mountain is "tall" but can't tell you it's exactly 3,452 meters high.

This is the problem TIGeR solves.

The Problem: The "Guessing Game"

Current robots rely on AI models that are great at recognizing patterns but terrible at math. They try to "guess" the answer based on what they've seen before, similar to how a student might guess the answer to a math problem because it looks like one they saw on a test. In the real world, where a robot needs to pick up a fragile egg or pour a drink without spilling, guessing is dangerous. You need centimeter-level precision.

The Solution: TIGeR (The Robot with a Calculator)

The authors created a framework called TIGeR (Tool-Integrated Geometric Reasoning).

Think of TIGeR not as a robot trying to memorize math formulas, but as a smart manager who knows when to call an expert.

The Manager (The AI Brain): When the robot sees a task like "Pour water from 5cm above the plant," the AI brain doesn't try to calculate the distance itself. Instead, it says, "I know I need to do some geometry here. I need to call the calculator."
The Experts (The Tools): The AI writes a tiny piece of computer code (like a recipe) and sends it to a specialized "calculator" tool. This tool uses real data from the camera (like depth sensors and lens settings) to do the exact math.
The Result: The calculator returns a precise number (e.g., "The point is at coordinates X, Y, Z"). The AI then tells the robot arm to move exactly there.

How They Taught the Robot

You can't just tell a robot to "be smart." You have to train it. The researchers built a massive training library called TIGeR-300K.

The Textbook: Imagine a textbook with 300,000 practice problems. But these aren't just questions and answers. Every problem includes the step-by-step solution, the calculator code used, and the intermediate steps.
The Training Method:
- Stage 1 (Supervised Learning): They showed the robot the textbook, teaching it, "When you see this type of question, write this specific code to get the answer."
- Stage 2 (Reinforcement Learning): They played a game of "Red Light, Green Light." If the robot wrote code that got the right answer, it got a gold star. If it wrote code that was messy or got the wrong number, it got a gentle correction. They even gave extra points for writing clean, logical code, not just lucky guesses.

What Can It Do Now?

Because TIGeR uses a "calculator" instead of a "guess," it can do things other robots can't:

The "Back of the Object" Trick: If you ask a normal robot to put a bag "behind" a toy, it might get stuck because it can't see the back of the toy (it's hidden). TIGeR calculates the 3D shape of the toy, figures out where the "back" is in 3D space even if it's invisible, and guides the robot there.
The "Exact Distance" Trick: It can move an object to be exactly 10cm away from another, not "kind of close."
The "Multi-View" Trick: If you show it two pictures taken from different angles, it can mathematically combine them to understand the 3D distance between objects, just like a human using two eyes to judge depth, but with math precision.

The Bottom Line

Before TIGeR, robots were like artists who could draw a beautiful picture of a table but couldn't measure the table to build a chair that fits.

With TIGeR, the robot is now like an architect. It still has the artistic vision to understand the scene, but it also carries a tape measure and a calculator. It doesn't just "see" the world; it computes the world, allowing it to perform delicate, precise tasks in the real world with centimeter-level accuracy.

Here is a detailed technical summary of the paper "TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics."

1. Problem Statement

Current Vision-Language Models (VLMs) excel at qualitative spatial reasoning (e.g., identifying "left of" or "reachable") but fail in quantitative geometric reasoning required for real-world robotics.

Limitations: Existing VLMs treat geometric problems as pattern recognition tasks, lacking the computational machinery for metric precision (e.g., centimeter-level accuracy). They often discard rich metric information from depth sensors and camera calibration, reducing 3D data to 2D image-like representations.
Consequence: This prevents embodied agents from performing precise tasks like pose estimation, collision-free trajectory planning, or metric-based manipulation (e.g., "place the object 5cm above the plant").
Gap: Current approaches either rely on statistical regression (which is probabilistic and prone to hallucination) or lack the ability to utilize external geometric libraries for exact calculations.

2. Methodology: TIGeR Framework

The authors propose TIGeR (Tool-Integrated Geometric Reasoning), a framework that transforms VLMs from perceptual estimators into geometric computers. Instead of internalizing complex geometry within neural weights, TIGeR enables models to recognize reasoning needs, synthesize code, and invoke external tools.

A. Core Architecture

TIGeR operates on calibrated metric inputs (depth, camera intrinsics/extrinsics) and follows a hierarchical workflow:

Visual Perception Tools: Extract sensory data (e.g., 2D bounding boxes, segmentation masks via SAM2, depth maps via MoGe-2).
Geometric Computation Tools: Convert 2D data to 3D, perform transformations, and calculate metrics.
- Key Mechanism: The VLM generates Python code to execute these computations using a sandboxed code executor (e.g., calculating distances between 3D bounding boxes or transforming coordinates).
Tool Categories:
- Perception: Camera intrinsics/extrinsics, depth sensors, object segmentation.
- Computation: 2D-to-3D box conversion, 3D-to-2D projection, code execution for arbitrary geometric logic.

B. Dataset: TIGeR-300K

To train this paradigm, the authors constructed TIGeR-300K, a dataset of 300,000 samples covering point transformations, pose estimation, trajectory generation, and spatial compatibility.

Generation Strategy: A hybrid approach combining:
1. Template-Based Synthesis: Using the CA-1M dataset with structured templates to generate precise, ground-truth geometric queries (274K samples).
2. LLM-Driven Rewriting: Using large models to rewrite existing Chain-of-Thought (CoT) data (SSR-CoT) into Tool-Integrated Reasoning (TIR) formats, inserting tool calls where precision is needed (35K samples).
Content: Each sample includes the problem statement, solution, complete tool invocation sequences, and intermediate computational steps.

C. Training Pipeline

The authors employ a two-stage training pipeline on the GLM-4.1V-Thinking base model:

Supervised Fine-Tuning (SFT): Trains the model on TIGeR-300K to learn the syntax of tool invocation and the structure of reasoning chains.
Reinforcement Fine-Tuning (RFT): Uses GRPO (Group Relative Policy Optimization) with a novel Hierarchical Reward Design to refine accuracy. The reward function includes five components:
- Format Reward: Ensures valid syntax for spatial tokens and tools.
- Tool Invocation Reward: Validates correct tool selection and parameter formatting.
- Parameter Content Reward: Penalizes errors in continuous (e.g., coordinates) and discrete parameters.
- Code Generation Reward: Checks if code executes and produces correct outputs.
- Answer Reward: Evaluates the final result against ground truth.

3. Key Contributions

Concept & Method: Introduced TIGeR, a paradigm shift from "learning geometry" to "computing geometry" via code generation and tool integration, enabling centimeter-level precision.
Dataset: Released TIGeR-300K, the first large-scale dataset explicitly designed for programmatic tool invocation in geometric reasoning.
Training Strategy: Proposed a two-stage SFT $\to$ RFT pipeline with a hierarchical reward system tailored for Tool-Integrated Reasoning (TIR), addressing the lack of intermediate precision in standard RL methods.

4. Experimental Results

The paper validates TIGeR across benchmarks, simulations, and real-world robotics.

Spatial Reasoning Benchmarks:
- Achieved State-of-the-Art (SOTA) performance on qualitative benchmarks (CV-Bench, BLINK, RoboSpatial) and quantitative benchmarks (Q-Spatial++).
- Outperformed Gemini 2.5-Pro by 5.83% on average across benchmarks.
- Demonstrated the ability to solve complex multi-view and metric tasks (e.g., calculating exact distances between objects in different images).
Simulation (Open6DOR V2):
- In position tracking tasks, TIGeR achieved an 83.7% average success rate, significantly outperforming baselines like OpenVLA (40.2%) and SoFar (72.4%).
- The 3D point prediction approach mitigated occlusion issues common in 2D-based methods.
Real-World Robotics (UR5 Arm + L515 Camera):
- Tested on four precise manipulation tasks (e.g., "place object 0.1m to the right," "place behind an occluded object").
- Metric Precision: Achieved 55% success on metric-precision placement (vs. 0% for OpenVLA).
- Occlusion Handling: Successfully resolved tasks requiring reasoning about "back" or "above" occluded objects, achieving 60–70% success rates where baselines failed completely.

5. Significance

Bridging the Gap: TIGeR effectively bridges the gap between the perceptual capabilities of VLMs and the rigorous computational requirements of robotics.
Interpretability & Adaptability: By explicitly generating code and invoking tools, the reasoning process is transparent (interpretable) and adaptable. New tools can be integrated without retraining the entire model.
Scalability: The framework demonstrates that combining LLMs with external geometric libraries is a more robust path to embodied AI than attempting to force neural networks to learn complex 3D geometry from scratch.
Impact: This work paves the way for robots that can perform high-precision, metric-aware tasks in unstructured environments, moving beyond simple "reachability" to exact "manipulability."