OrdinalBench: A Benchmark Dataset for Diagnosing Generalization Limits in Ordinal Number Understanding of Vision-Language Models

Imagine you have a very smart robot friend who can look at a picture and tell you what's in it. It can spot a cat, a car, or a tree with amazing accuracy. But, if you ask it, "Can you walk through this maze and tell me which object is the 250th one you pass?" or "Skip every third object and tell me the name of the 50th one you see," this robot suddenly starts acting like it's never seen a maze before. It gets lost, loses count, or just guesses.

This is exactly what the paper ORDINALBENCH is about. It's a new "report card" designed to test how well Vision-Language Models (the AI brains behind these robots) can handle counting in order and following rules, rather than just recognizing things.

Here is a breakdown of the paper using simple analogies:

1. The Problem: The "Smart but Clueless" Robot

Current AI models are like photographic memory geniuses who can't do math.

What they are good at: If you show them a picture of a room, they can say, "There is a red chair, a blue lamp, and a green plant." (This is called Cardinal Understanding—just knowing how many or what things are).
Where they fail: If you ask, "Walk clockwise starting from the red chair and tell me what the 15th item is," they get confused. They struggle to keep track of their place in a sequence, especially if the path is twisty (like a maze) or the number is huge (like 200 or 300).

2. The Solution: ORDINALBENCH (The "Obstacle Course")

The researchers built a giant, automated obstacle course called ORDINALBENCH. Instead of showing the AI real-world photos (which are messy and hard to control), they created thousands of synthetic, clean images that look like board games or mazes.

Think of it like a video game level generator that creates three types of challenges:

The Simple Loop: A circle of objects. Easy, right?
The Maze: A complex, winding path with no dead ends. You have to follow specific rules (e.g., "If you hit a wall, turn right").
The "Skip" Challenge: You have to count, but you skip every 3rd object. This forces the AI to do a little math while walking.

The difficulty is dialed up like a volume knob:

Small Numbers: "Find the 5th object."
Huge Numbers: "Find the 250th object."
Crowded Scenes: A maze with 400 objects vs. a maze with 20.

3. The Test: Not Just "Right or Wrong"

Usually, when we test AI, we just ask, "What is the answer?" and check if it's right.
ORDINALBENCH is stricter. It demands the AI show its work.

Imagine a teacher grading a math test.

Old Way: The student writes "42." The teacher checks if 42 is right.
ORDINALBENCH Way: The student must write out every single step: "Step 1: I am at the start. Step 2: I turned right. Step 3: I counted one..."

The researchers use special metrics to grade this "step-by-step" trace:

Did they get the final answer right? (Final Accuracy)
Did they get lost halfway through? (Step Consistency)
Did they even try to follow the rules? (Trace Coverage)

4. The Results: The "Crash"

The researchers tested top-tier AI models (like GPT-5, Gemini, and Qwen) on this course. The results were shocking:

The "Small Number" Trap: The models were okay with small numbers (like finding the 5th object).
The "Big Number" Wall: As soon as the number got big (over 100) or the path got twisty (a maze), the models' performance crashed. They dropped from being "smart" to "random guessers."
The "Skip" Confusion: When asked to skip counting (count every 3rd item), the models almost completely failed. It's like asking a human to walk forward but only count their steps if they are wearing red shoes; the AI just forgot the rule entirely.

The Metaphor:
Imagine a marathon runner who is incredibly fast on a straight track (recognizing objects). But the moment you put a complex obstacle course in front of them and say, "Run, but stop and count every third step," they trip, forget the rules, and stop running.

5. Why Does This Happen?

The paper suggests two main reasons:

The "Language Bottleneck": The AI tries to turn the picture into words in its head to solve the problem. But describing a complex maze in words is messy. It loses the "map" and gets confused.
Lack of "Procedural Muscle": The AI is great at knowing facts (declarative knowledge) but bad at doing things step-by-step (procedural knowledge). It hasn't been trained to "hold a state" (remembering "I am currently on step 47") while moving through a visual scene.

6. The Takeaway

This paper isn't just saying "AI is bad at counting." It's saying: "If we want AI to be a true 'Visual Agent' that can navigate the real world (like a robot vacuum or a self-driving car), it needs to learn how to follow complex, multi-step rules, not just recognize pictures."

ORDINALBENCH is the new gym where these AI models can train to get stronger at this specific skill, so one day they won't just "see" the world, but they can truly "navigate" it.

Here is a detailed technical summary of the paper "ORDINALBENCH: A Benchmark Dataset for Diagnosing Generalization Limits in Ordinal Number Understanding of Vision-Language Models."

1. Problem Statement

While Vision-Language Models (VLMs) have achieved strong performance on general multimodal benchmarks, they exhibit significant weaknesses in ordinal number understanding. Specifically, VLMs struggle to:

Track relative positions of objects sequentially.
Generalize counting procedures to large indices (e.g., $N \ge 100$ ).
Execute algorithmic rules (e.g., "skip counting" or navigating complex paths like mazes).

Existing benchmarks primarily focus on cardinal understanding (counting "how many") or static compositional reasoning. They fail to diagnose the procedural reasoning required to execute "find the $N$ -th object" tasks, which demand sustained attention, internal state maintenance, and rule-based traversal. This gap limits the deployment of VLMs in real-world applications like UI automation and embodied robotics, where agents must follow sequential instructions.

2. Methodology: ORDINALBENCH

The authors introduce ORDINALBENCH, a diagnostic benchmark designed to systematically evaluate and isolate the generalization limits of procedural ordinal reasoning.

Core Task Definition

The primary task is $N$ -th Object Identification. Given an image ( $I$ ), a traversal rule ( $R$ ), a starting object ( $o_{start}$ ), and a target ordinal ( $N$ ), the model must:

Execute a traversal rule (e.g., clockwise, maze navigation logic).
Maintain an internal count state.
Identify the object at position $N$ .
Crucially, output a structured, step-by-step reasoning trace ( $\hat{T}$ ) alongside the final answer.

Difficulty Control Axes

The benchmark systematically varies difficulty across three independent axes to pinpoint failure modes:

Ordinal Magnitude: Ranges from small numbers (Within objects) to extreme cases ( $N$ up to 300), testing numerical generalization.
Arrangement Complexity:
- Single-Loop: Simple closed loops (2D/3D).
- Maze-Loop: Algorithmically generated complex paths with no dead ends, requiring global topological understanding.
Object Count / Visual Load: Varies the number of objects or grid size to test attention and working memory.
Algorithmic Rules (Skip Counting): Introduces a stride $k > 1$ (counting every $k$ steps), forcing the model to execute a specific algorithm rather than simple sequential following.

Dataset Specifications

Scale: 39,000 question-answer pairs across 2,600 images.
Domains: 2D Single-Loop, 2D Maze-Loop, and 3D Single-Loop.
Design Philosophy: Uses synthetic data with clear, unambiguous identifiers (e.g., "O23", "Z64") to ensure errors stem from reasoning failures rather than visual recognition ambiguity.
Ground Truth: Every sample includes a ground-truth reasoning trajectory.

Evaluation Metrics

Beyond standard final accuracy, ORDINALBENCH introduces metrics to diagnose the process:

Final Accuracy (Acc@N): Correctness of the final answer.
Normalized Longest Correct Prefix (nLCP): Measures how far the model's reasoning trace remains correct before failing.
Stepwise Trace Accuracy (STA): The percentage of individual steps in the trace that match the ground truth.
Trace Coverage (Cov.): The fraction of samples where the model successfully outputs a valid structured trace.

3. Key Contributions

ORDINALBENCH Dataset: A large-scale, reproducible benchmark with 39k QA pairs that standardizes ordinal number understanding as a diagnostic task, covering complex paths and large numerical scales.
Diagnostic Framework: An evaluation protocol that requires structured step-by-step traces, allowing researchers to distinguish between "lucky guesses" and genuine procedural execution.
Empirical Analysis: A comprehensive zero-shot evaluation of state-of-the-art (SOTA) VLMs, revealing specific failure modes in procedural reasoning.

4. Experimental Results

The authors evaluated proprietary models (GPT-5, Gemini 2.5 Flash Lite) and open-source models (Qwen2.5-VL, InternVL3.5, Molmo) in a zero-shot setting.

Overall Performance:
- Performance drops drastically as task complexity increases.
- In simple 2D Single-Loop tasks, top models (Qwen2.5-VL-32B) reached ~32% accuracy.
- In complex 2D Maze-Loop tasks, even the best model (GPT-5) dropped to 11.04%, barely above the chance level (~2.5%).
- The Molmo series performed at or below chance levels across all tasks, largely due to low Trace Coverage (failure to format output correctly).
Impact of Difficulty Axes:
- Ordinal Magnitude: Accuracy declines sharply as $N$ increases. In Maze tasks with $N \ge 100$ , average accuracy fell below 3%.
- Skip Counting (Stride $k > 1$ ): This caused the most severe degradation. For example, Qwen2.5-VL-7B dropped from 42.62% (stride 1) to 11.80% (stride 2), approaching random chance.
- Visual Load: Increased object counts or grid sizes led to performance drops steeper than the theoretical chance level, indicating attention failures.
Trace Analysis:
- Metrics like nLCP and STA were significantly lower than Acc@N in difficult conditions, confirming that models often fail early in the reasoning chain or lose track of the state, even if they occasionally guess the correct final answer.

5. Significance and Implications

The study highlights a fundamental weakness in current VLMs: they are strong at static pattern recognition but weak at procedural and algorithmic reasoning.

Root Causes Identified:
- Language Bottleneck: Converting complex visual structures (like mazes) into linguistic representations leads to information loss and ambiguity.
- Lack of Procedural Knowledge: Models struggle to maintain internal state updates (e.g., "current count") while executing multi-step algorithms.
Future Directions:
- Model Design: Incorporate explicit, structured state representations to bypass the language bottleneck.
- Training Strategies: Develop curricula that emphasize step-level procedural execution and long-horizon reasoning.
- Evaluation: Move beyond final accuracy to include trace-based metrics (nLCP, STA) to ensure models are reasoning correctly, not just guessing.

Conclusion: ORDINALBENCH serves as a critical "stress test" for VLMs, demonstrating that despite high scores on general benchmarks, current models lack the robust sequential reasoning required for real-world agent tasks. The dataset and toolkit are open-source to drive the development of more capable visual agents.