$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Imagine you hire a super-smart, highly trained personal assistant to help you with your bank account. You tell them, "I lost my wallet, please freeze my cards and check for fraud."

In the past, we tested these assistants by giving them a list of tools (like a "Freeze Card" button) and asking them to press the right one. But in the real world, things are messier. Your assistant doesn't just need a button; they need to know where the button is, when to press it, and what rules apply before they press it. They have to read a massive, messy library of 700 different documents (policies, FAQs, internal rules) while talking to you, all while trying to fix your problem.

This paper introduces a new test called τ-Knowledge (Tau-Knowledge) to see if AI agents can actually handle this real-world chaos.

Here is the breakdown using simple analogies:

1. The Problem: The "Blind" Librarian

Imagine your AI agent is a librarian who has been hired to find a specific book in a library with 700 books.

Old Tests: We gave the librarian the book title and asked, "Can you find this?" or "Can you check out this book?" We tested finding the book and checking it out as two separate tasks.
The Real World: In reality, the librarian doesn't know the book titles. They have to wander the aisles, read the spines of 700 books, figure out which one has the rules about "lost wallets," read the fine print, and then decide whether to freeze your card or cancel it. If they pick the wrong book, they might freeze the wrong card or break a rule.

2. The New Test: "τ-Banking"

The researchers built a fake bank called τ-Banking.

The Library: It contains 700 documents about fake credit cards, savings accounts, and strict rules (e.g., "You can't close an account if you have a pending dispute").
The Tools: The agent has tools to actually change the database (like "Freeze Card" or "Close Account"), but they don't know these tools exist yet. They have to find the tool's name inside the documents first.
The Customer: A simulated human who might be vague ("I lost my stuff") or change their mind mid-conversation.

3. How the Test Works

The AI agent has to act like a real customer service rep:

Listen to the user.
Search the library (using tools like a search engine or a file explorer) to find the right policy.
Read the policy to understand the rules.
Find the specific tool needed to fix the problem.
Execute the fix while following the rules.

4. The Results: The "Smart" Agents Are Still Stuck

The researchers tested the world's smartest AI models (like GPT-5.2 and Claude-4.5). The results were surprising and a bit scary:

The Success Rate is Low: Even the best AI only succeeded about 25% of the time on the first try. That means 3 out of 4 times, they failed to help the customer correctly.
The "Golden" Test: They tried giving the AI the exact right documents immediately (removing the search part). Even then, the success rate only jumped to about 40%.
- Analogy: It's like giving a chef the exact recipe and ingredients, but they still burn the meal because they don't understand the cooking instructions. The problem isn't just finding the info; it's reasoning with it.
The "Search" Trap: Some AIs tried to search too much. They would read 20 documents, get confused, search again, and waste time. Others would guess the answer without reading enough.

5. Why This Matters

The paper argues that we need to stop testing AI just on how well they "search" or how well they "follow instructions." We need to test them on efficiency and reliability in a messy, real-world setting.

Efficiency: If an AI takes 10 minutes to solve a problem that should take 2 minutes, the customer gets frustrated.
Reliability: If an AI freezes your card when it shouldn't, or misses a fraud alert, that's a disaster.

The Big Takeaway

Think of current AI agents as brilliant students who are terrible at following a map. They can read a book, but if you ask them to navigate a complex city (the bank's database) while talking to a confused tourist (the user), they often get lost, pick the wrong bus, or forget the rules of the road.

τ-Knowledge is the new driving test that forces these AIs to prove they can actually drive the car, not just recite the driver's manual. The results show that while the "students" are smart, they still have a long way to go before they can be trusted to drive us around alone.

Here is a detailed technical summary of the paper "τ-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge."

1. Problem Statement

Conversational agents are increasingly deployed in knowledge-intensive domains (e.g., fintech, healthcare) where success depends on retrieving and applying information from large, proprietary, and unstructured corpora during live interactions. However, existing evaluation benchmarks suffer from three critical gaps:

Isolation of Capabilities: Most benchmarks evaluate retrieval (finding information) and tool use (executing actions) independently, failing to test the agent's ability to coordinate both simultaneously.
Lack of Realism: Prior work often assumes fully specified tool interfaces and static environments, ignoring the stochasticity of live user interactions (ambiguous goals, evolving intent) and the need to discover tools hidden within documentation.
Insufficient Reasoning Depth: Many benchmarks focus on fact-based question answering rather than long-horizon reasoning over complex, interdependent policies and dynamic database states.

The paper argues that current agents struggle to bridge the gap between retrieving unstructured knowledge and executing policy-compliant state changes in realistic, multi-turn scenarios.

2. Methodology: The τ-Knowledge Benchmark

The authors introduce τ-Knowledge, an extension of the existing τ-Bench, featuring a new domain called τ-Banking.

A. The τ-Banking Domain

Scenario: Simulates a fintech customer support environment where an agent must assist a user with account management, dispute resolution, and product recommendations.
Knowledge Base (KB): A corpus of 698 natural-language documents (approx. 194k tokens) covering 21 product categories (checking, savings, credit cards, etc.). The documents include product specs, internal procedural policies, and tool documentation.
Dynamic State: The environment maintains a shared global state (database) tracking accounts, transactions, and referrals. The agent cannot observe this state directly; it must infer it through tool outputs and user messages.
Task Structure: Tasks are formulated as Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs). Success requires:
1. Retrieving relevant documents from the KB.
2. Reasoning over policies (e.g., "credit limit increases are blocked if disputes are pending").
3. Discovering and unlocking Discoverable Tools (tools not initially available to the agent; their signatures are hidden in the KB).
4. Executing tool-mediated state changes (e.g., closing an account, freezing a card).

B. Construction Pipeline

The benchmark was built using a scalable, multi-stage pipeline to ensure internal consistency:

Structured Generation: LLMs generate a structured database of products, policies, and variables (a constraint system).
Unstructured Conversion: The structured data is converted into natural-language documents (FAQs, policy articles) while preserving logical consistency.
Task Creation: Tasks are co-constructed with human refinement to mirror realistic workflows (e.g., ordering replacement cards, handling lost wallets).
Verification: Tasks are audited to ensure they are solvable only via the "Gold" set of documents and that no shortcuts exist.

C. Evaluation Configurations

The benchmark is retrieval-agnostic, testing agents across diverse search strategies:

Dense Retrieval: Embedding-based semantic search (using text-embedding-3-large and Qwen3-embedding-8b).
Sparse Retrieval: Lexical search using BM25.
Terminal-Based Search: Agents navigate the KB as a file system using shell commands (grep, cat, find), simulating modern AI-assisted development workflows.
Golden Retriever: A control setting where the agent is given the exact required documents in context to isolate reasoning capabilities from retrieval failures.

D. Metrics

$pass^k$ : The probability that a task is successfully completed in all $k$ independent trials (measuring reliability).
Efficiency Metrics: Task duration, token cost, number of tool calls, and median turn time.
Document Recall: The percentage of "Gold" documents that appear in the agent's context during the trajectory.

3. Key Results

The authors evaluated frontier models (GPT-5.2, Claude-4.5-Opus/Sonnet, Gemini-3-Pro/Flash) across various configurations.

Low Overall Performance: Even the best configuration (GPT-5.2 with high reasoning and terminal search) achieved only 25.52% $pass^1$ (single-trial success).
Reliability Degradation: Performance drops sharply with repeated trials. The best $pass^4$ (success in 4 consecutive trials) was only 13.40%, indicating agents are highly inconsistent.
Retrieval is Not the Sole Bottleneck: In the "Golden Retriever" setting (where retrieval is removed), the best model (Claude-4.5-Opus) only reached 39.69% $pass^1$ . This proves that agents struggle significantly with reasoning over complex policies and cross-document dependencies, not just finding information.
Search Strategy Trade-offs:
- Terminal Search: Recent high-reasoning models (GPT-5.2, Claude-4.5) performed better with terminal-based search than with dense retrieval, likely due to better handling of unstructured text. However, this came at a high cost: terminal agents took ~9x longer and used ~2.3x more shell commands than dense retrieval agents.
- Dense Retrieval: Faster and more efficient but often less effective for complex, multi-hop reasoning tasks in this specific domain.
Model Differences:
- GPT-5.2 (High Reasoning): Showed the highest reliability ( $pass^4$ ) but suffered from high latency and excessive tool calls due to "brittle, assumption-driven" search behaviors.
- Claude-4.5: Achieved comparable success rates to GPT-5.2 but with significantly higher efficiency (fewer tokens, fewer tool calls, shorter duration).

4. Qualitative Failure Analysis

The paper identifies four primary failure modes:

Complex Interdependencies (~14.5%): Agents fail to reason across multiple documents to find optimal solutions (e.g., prioritizing a small promotional boost over a higher base rate).
Implicit Subtask Ordering (~5%): Agents fail to respect hidden topological constraints (e.g., trying to increase a credit limit while a dispute is still pending, which policy forbids).
Overtrusting User Assertions (~4%): Agents accept user claims (e.g., "my dispute is approved") without verifying against the system state.
Search Inefficiency & Assumptions (~23%): Agents make unwarranted assumptions about underspecified user requests rather than clarifying or searching further, leading to incorrect paths.

5. Significance and Contributions

Realistic Evaluation Standard: τ-Knowledge provides the first benchmark that rigorously evaluates the integration of unstructured knowledge retrieval, policy reasoning, and tool use in a dynamic, user-facing environment.
Highlighting the "Reasoning Gap": The results demonstrate that current frontier models are not yet reliable enough for high-stakes, human-facing deployments where policy compliance is critical. The gap between retrieval and reasoning remains a major hurdle.
Efficiency vs. Accuracy Trade-off: The study reveals a critical tension in agent design: strategies that improve accuracy (like freeform terminal search) often drastically reduce efficiency (latency and cost), which is vital for real-world deployment.
Future Direction: The paper advocates for measuring agent progress not just by final task success, but by solution efficiency—the ability to reach correct outcomes with minimal time, tool calls, and conversational backtracking.

In conclusion, τ-Knowledge exposes significant limitations in current agentic systems, showing that while they can retrieve information, they frequently fail to synthesize that information into reliable, policy-compliant actions in complex, unstructured environments.

τττ-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

1. The Problem: The "Blind" Librarian

2. The New Test: "τ-Banking"

3. How the Test Works

4. The Results: The "Smart" Agents Are Still Stuck

5. Why This Matters

The Big Takeaway

1. Problem Statement

2. Methodology: The τ-Knowledge Benchmark

A. The τ-Banking Domain

B. Construction Pipeline

C. Evaluation Configurations

D. Metrics

3. Key Results

4. Qualitative Failure Analysis

5. Significance and Contributions

More like this

RoboLayout: Differentiable 3D Scene Generation for Embodied Agents

Real-Time AI Service Economy: A Framework for Agentic Computing Across the Continuum

Reasoning Models Struggle to Control their Chains of Thought

Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

$τ$ -Knowledge: Evaluating Conversational Agents over Unstructured Knowledge