ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models

Imagine you are hiring a very smart, well-read travel agent to plan a perfect 3-day trip to Philadelphia. You give them a list of very specific requests: "I want a hotel with great service," "I need restaurants with fresh food," and "I want to visit museums that are fun for kids."

In the past, we tested these AI agents (Large Language Models, or LLMs) by asking them simple questions like, "Which museum is open on Sundays?" or "What is the capital of France?" They were great at answering those.

But ItinBench is a new, tougher test. It asks the AI to do two things at once:

The "Word Game" (Verbal Reasoning): Understand your complex, picky preferences and find the right places.
The "Map Game" (Spatial Reasoning): Figure out how to drive between all those places without wasting time or gas.

Here is a breakdown of what the paper found, using some everyday analogies.

1. The Setup: The "Super-Travel Agent" Test

The researchers built a giant digital database of real restaurants, hotels, and attractions in Philadelphia. They created 500 different "customer requests" (like the one above) and asked various AI models (like GPT-4o, Llama, and Gemini) to generate a full 3-day itinerary.

The twist? The AI had to not just pick the right places, but also optimize the route. It's like asking a chef to not only pick the perfect ingredients for a meal but also to chop, cook, and plate them in the most efficient order so the food doesn't get cold.

2. The Two Skills Being Tested

The paper argues that human intelligence has two distinct gears, and current AIs struggle to shift between them smoothly:

Verbal Reasoning (The Librarian): This is the AI's ability to read your text, understand "freshness" means "good fish," and find a hotel that matches "good service." It's like a librarian finding the right book based on a vague description.
Spatial Reasoning (The Navigator): This is the ability to look at a map and realize, "Oh, if I go to the Zoo first, then the Museum, then the Hotel, I'm driving in circles. I should go to the Museum first because it's right next to the Zoo." It's like a GPS that actually thinks about the geometry of the city.

3. The Big Discovery: The "Juggling Act" Failure

The most surprising finding is that when you ask the AI to do both at the same time, it starts to drop the balls.

The Analogy: Imagine a juggler who is amazing at juggling three red balls (words). If you ask them to juggle three red balls and three blue balls (spatial routes) at the same time, they don't just get slightly worse; they often drop the red balls entirely.
The Result: When the AI was asked to optimize the route, its ability to follow your specific text preferences (like "I want French food") actually got worse. It got so focused on the map that it forgot your specific tastes.

4. The "Cheat Code" Discovery

The researchers found something interesting about how the AI does the "Map Game."

The Problem: If you just say, "Plan a route," the AI often fails. It's like asking someone to navigate a city they've never seen just by looking at a list of street names.
The "Cheat Code": When the researchers gave the AI a pre-made list of "clusters" (e.g., "These 5 places are all in the same neighborhood"), the AI got much better at the route.
The Takeaway: The AI isn't actually "seeing" the map in its mind like a human does. It's more like a super-fast text processor. It's connecting the dots between words ("Cluster A is near Cluster B") rather than visualizing the geometry. It's solving a word puzzle, not a geometry problem.

5. The Verdict: We Are Not There Yet

The paper concludes that while AI is getting incredibly smart at reading and writing, it still struggles with real-world planning where you have to balance logic, geography, and human preferences simultaneously.

Current State: If you ask an AI to plan a trip, it might give you a great list of restaurants (Verbal win), but the route it suggests might involve driving 20 extra miles in circles (Spatial loss).
The Future: To build a true "Travel Agent AI," we need to teach it to stop treating the map as just another list of words and start helping it understand space and distance in a way that feels more like human intuition.

In short: The paper introduces a new gym (ItinBench) where we can see that AI is a great reader but a clumsy navigator. To make them truly useful for real-life planning, we need to help them learn how to juggle both skills without dropping the ball.

ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models

1. The Setup: The "Super-Travel Agent" Test

2. The Two Skills Being Tested

3. The Big Discovery: The "Juggling Act" Failure

4. The "Cheat Code" Discovery

5. The Verdict: We Are Not There Yet

1. Problem Statement

2. Methodology: ItinBench

A. Data Construction

B. Experimental Tasks

C. Evaluation Metrics

3. Key Contributions

4. Key Results

5. Significance and Conclusion

ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models

1. The Setup: The "Super-Travel Agent" Test

2. The Two Skills Being Tested

3. The Big Discovery: The "Juggling Act" Failure

4. The "Cheat Code" Discovery

5. The Verdict: We Are Not There Yet

1. Problem Statement

2. Methodology: ItinBench

A. Data Construction

B. Experimental Tasks

C. Evaluation Metrics

3. Key Contributions

4. Key Results

5. Significance and Conclusion

More like this

When both Grounding and not Grounding are Bad -- A Partially Grounded Encoding of Planning into SAT (Extended Version)

Teaching an Agent to Sketch One Part at a Time

Learning to Disprove: Formal Counterexample Generation with Large Language Models

PA2D-MORL: Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning

PowerLens: Taming LLM Agents for Safe and Personalized Mobile Power Management