FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use

Imagine you hire a very smart, well-read assistant (an AI) to help you manage your money. You ask it, "What's the stock price of Apple right now?" or "Can you buy 10 shares for me?"

In the past, these assistants were like librarians. They could read a book about Apple and tell you what the price was last year. But they couldn't actually go to the stock market, check the live price, or press the "buy" button.

Now, we are trying to turn these assistants into active traders. We want them to use real tools (like a live stock ticker, a news feed, or a trading app) to get answers and take action.

The Problem:
Giving a super-smart AI a set of real financial tools is dangerous.

If a human makes a mistake, they might lose a few dollars.
If an AI makes a mistake, it might think "Apple" means "Apple Pie" and buy a bakery, or it might use yesterday's data to make a trade that loses you thousands.
Current tests for these AIs are like video games. They use fake, made-up tools that never break and don't have real rules. They don't test if the AI understands that "buying" is different from "reading."

The Solution: FinToolBench
The authors of this paper built a giant, realistic training gym called FinToolBench. Think of it as a "Flight Simulator for Financial AIs," but instead of flying planes, they are managing money.

Here is how it works, using simple analogies:

1. The Gym Equipment (The Tools)

Instead of a few fake tools, they gathered 760 real, working financial tools.

The Source: They grabbed free tools from the internet (like a massive library of stock charts, currency converters, and company reports).
The Filter: They tested every single one to make sure it actually works. If a tool was broken or required a credit card you didn't have, they threw it out.
The Result: A massive toolbox where every wrench and screwdriver is real and ready to use.

2. The Exam Questions (The Tasks)

They created 295 tricky questions that require using these tools.

You can't just "guess" the answer from your memory.
Example: "What was the exchange rate between the Euro and the Yen right now?" (You can't know this without a tool).
Example: "Find the latest safety report for this specific bank." (You need to search a database).

3. The "Three Golden Rules" (The New Scoring System)

This is the most important part. In the past, if an AI got the right answer, it got a gold star. In FinToolBench, getting the right answer isn't enough. The AI must follow three strict rules, or it fails:

Rule #1: Freshness (Timeliness)
- Analogy: If you ask for the weather right now, and the AI shows you a forecast from 2010, it's wrong, even if the forecast was accurate back then.
- The Test: Did the AI use a tool that gives live data, or did it use an old, static file?
Rule #2: Don't Overstep (Intent)
- Analogy: If you ask, "How much does a Ferrari cost?" (Information), the AI should not try to buy one (Transaction).
- The Test: Did the AI try to spend your money when you just wanted to read about it?
Rule #3: Stay in Your Lane (Domain)
- Analogy: If you ask about a Bitcoin price, the AI shouldn't look up the price of a Gold Bar. They are different markets.
- The Test: Did the AI use the right "department" of tools for the specific question?

4. The "Smart Assistant" (FATR)

The authors also built a basic "smart assistant" called FATR to show how to do this right.

It's like giving the AI a checklist before it acts.
Before it picks a tool, it asks: "Is this tool fresh enough? Is it safe? Is it for the right market?"
This simple checklist helps the AI avoid silly mistakes and makes it much more reliable.

Why Does This Matter?

Imagine you are building a self-driving car. You don't just want to know if the car can drive in a parking lot (the old tests). You want to know if it can drive in a rainstorm, avoid a sudden pedestrian, and follow traffic laws (the new FinToolBench test).

FinToolBench is the first test that checks if AI agents can handle the real, messy, high-stakes world of finance without crashing the car. It ensures that when we let AI handle our money, it's not just "smart," but also safe, timely, and compliant.

The Bottom Line:
The authors are saying, "We can't trust AI with our money until we test it in a real financial gym with real rules. We built that gym, and we're opening the doors for everyone to test their AI agents."

1. Problem Statement

The integration of Large Language Models (LLMs) into finance is shifting from static analysis to dynamic, agentic interaction. However, current evaluation benchmarks fail to address the unique constraints of the financial domain:

Gap in Existing Benchmarks: General tool benchmarks (e.g., API-Bank, StableToolBench) focus on API correctness but lack financial rigor. Existing financial benchmarks (e.g., FinanceBench, FinQA) focus on static knowledge or document QA, ignoring the execution of real tools.
Critical Failure Modes: In finance, a "correct" API call can still be a failure if it violates specific constraints:
1. Timeliness: Using stale data (e.g., daily snapshots) for a query requiring real-time rates.
2. Intent Restraint: Escalating an informational query into a transactional action (e.g., executing a trade) without explicit authorization.
3. Domain Alignment: Using tools from the wrong market sector (e.g., using equity tools for crypto queries), leading to hallucinated domain logic.
Lack of Realism: Most financial evaluations rely on mock tools or static datasets, failing to test agents in a runnable, executable environment with real-world API volatility and compliance requirements.

2. Methodology: FinToolBench

The authors introduce FinToolBench, the first runnable benchmark dedicated to evaluating financial tool-learning agents.

A. Dataset Construction

The benchmark consists of two core components constructed via an 8-stage pipeline:

Tool Inventory (760 Tools):
- Sources: Aggregated from RapidAPI (free-tier, diverse real-time services) and AkShare (open-source Python library for financial data).
- Filtering: Strict executability filtering (removing broken URLs, insufficient rate limits, auth failures) and normalization into a unified manifest schema.
- Annotation: Each tool is annotated with three finance-specific attributes:
  - Update Frequency: (Real-time, Daily, As-filed, Periodic, Static).
  - Intent Type: (Informational, Advisory, Transactional).
  - Regulatory Domain: (Equity, Bond, Crypto, Macro, etc.).
Question Set (295 Questions):
- Sourced from FinanceBench and OpenFinData.
- Filtering: Only questions requiring real-time data retrieval or specific calculations are retained (excluding those solvable by parametric memory).
- Composition: 166 single-tool questions and 129 multi-tool questions.
- Alignment: Verified via semantic retrieval and human-in-the-loop expert review to ensure logical necessity.

B. Evaluation Framework

The evaluation separates Capability (can the agent run the tool?) from Compliance (is the tool choice acceptable under financial constraints?).

Capability Metrics:
- TIR (Tool Invocation Rate): Fraction of questions where tools are called.
- TESR (Tool Execution Success Rate): Fraction of questions where the final tool call succeeds.
- CER (Conditional Execution Rate): Success rate given that a tool was invoked ( $TESR / TIR$ ).
- Soft Score/CSS: LLM-based grading of the final answer correctness.
Compliance Metrics (Mismatch Rates):
- TMR (Timeliness Mismatch Rate): Did the agent use a stale tool for a real-time query?
- IMR (Intent Mismatch Rate): Did the agent use a transactional tool for an informational query?
- DMR (Domain Mismatch Rate): Did the agent use tools from the wrong regulatory domain?
- Mechanism: An LLM judge (GPT-5.1) compares the executed tool trace against the question's implicit constraints and the tool's metadata tags.

C. Baseline: FATR (Finance-Aware Tool Retrieval)

To establish a reference point, the authors propose FATR, a lightweight baseline that enhances a generic LLM planner:

Attribute Injection: Finance tags (timeliness, intent, domain) are injected directly into the "Tool Cards" presented to the planner.
Constraint-Aware Planning: The planner explicitly infers required constraints ( $T(q), I(q), D(q)$ ) before selecting tools.
Stability Wrappers: Includes caching, retries, and output compression to handle API instability.

3. Key Contributions

FinToolBench Benchmark: A scalable, runnable benchmark with 760 executable financial tools and 295 complex, tool-required questions, moving beyond static QA to agentic workflows.
Finance-Aware Evaluation Metrics: A novel framework that quantifies compliance (TMR, IMR, DMR) alongside capability, addressing the "silent failures" common in financial AI.
FATR Baseline: A practical implementation demonstrating how injecting domain attributes improves trace stability and reduces compliance errors without training a specialized model.
Open Source: The tool manifest, execution environment, and evaluation code are open-sourced to standardize future research.

4. Experimental Results

The authors evaluated four LLM backends (Doubao-Seed-1.6, Qwen3-8B, GLM-4.7-Flash, GPT-4o) using the FATR framework.

Trade-offs Observed:
- Qwen3-8B: Highest tool invocation (TIR = 0.87) but low execution success (CER = 0.34), indicating frequent argument instantiation errors.
- GPT-4o: Extremely conservative (TIR = 0.23), often refusing to call tools. However, when it did, it had the highest precision (CER = 0.62) and lowest mismatch rates.
- Doubao-Seed-1.6: Achieved the most balanced performance with the highest overall execution success (TESR = 0.33) and strong conditional reliability.
Impact of Attribute Injection:
- Injecting finance tags into tool cards slightly reduced TIR (agents became more selective) but significantly improved CER and reduced Mismatch Rates (TMR, IMR, DMR).
- This proves that explicit constraints guide agents away from risky or irrelevant tool choices, even if it means fewer total calls.
Failure Analysis:
- Agents often struggle with multi-step reasoning (3+ tools) and numeric fidelity (calculating exact averages from proxies).
- High TIR does not guarantee high CSS; agents often execute tools correctly but fail to synthesize the final answer or adhere to the specific reasoning criteria of the dataset.

5. Significance

Paradigm Shift: Moves financial AI evaluation from "Does the answer match?" to "Was the tool trace safe, timely, and compliant?"
Risk Mitigation: Provides a mechanism to audit agents for regulatory compliance (e.g., preventing unauthorized trades or hallucinated market domains) before deployment.
Standardization: Establishes a common, runnable testbed that bridges the gap between theoretical tool learning and the high-stakes reality of financial data volatility and strict compliance.
Future Direction: Highlights that future financial agents must not only be "smart" at reasoning but also "disciplined" in their tool selection and execution constraints.