OrdinalBench: A Benchmark Dataset for Diagnosing Generalization Limits in Ordinal Number Understanding of Vision-Language Models

The paper introduces OrdinalBench, a comprehensive benchmark dataset and evaluation framework designed to diagnose and expose the significant generalization limitations of Vision-Language Models in understanding ordinal numbers and performing sequential reasoning tasks involving large indices and complex paths.

Yusuke Tozaki, Hisashi Miyamori

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine you have a very smart robot friend who can look at a picture and tell you what's in it. It can spot a cat, a car, or a tree with amazing accuracy. But, if you ask it, "Can you walk through this maze and tell me which object is the 250th one you pass?" or "Skip every third object and tell me the name of the 50th one you see," this robot suddenly starts acting like it's never seen a maze before. It gets lost, loses count, or just guesses.

This is exactly what the paper ORDINALBENCH is about. It's a new "report card" designed to test how well Vision-Language Models (the AI brains behind these robots) can handle counting in order and following rules, rather than just recognizing things.

Here is a breakdown of the paper using simple analogies:

1. The Problem: The "Smart but Clueless" Robot

Current AI models are like photographic memory geniuses who can't do math.

  • What they are good at: If you show them a picture of a room, they can say, "There is a red chair, a blue lamp, and a green plant." (This is called Cardinal Understanding—just knowing how many or what things are).
  • Where they fail: If you ask, "Walk clockwise starting from the red chair and tell me what the 15th item is," they get confused. They struggle to keep track of their place in a sequence, especially if the path is twisty (like a maze) or the number is huge (like 200 or 300).

2. The Solution: ORDINALBENCH (The "Obstacle Course")

The researchers built a giant, automated obstacle course called ORDINALBENCH. Instead of showing the AI real-world photos (which are messy and hard to control), they created thousands of synthetic, clean images that look like board games or mazes.

Think of it like a video game level generator that creates three types of challenges:

  • The Simple Loop: A circle of objects. Easy, right?
  • The Maze: A complex, winding path with no dead ends. You have to follow specific rules (e.g., "If you hit a wall, turn right").
  • The "Skip" Challenge: You have to count, but you skip every 3rd object. This forces the AI to do a little math while walking.

The difficulty is dialed up like a volume knob:

  • Small Numbers: "Find the 5th object."
  • Huge Numbers: "Find the 250th object."
  • Crowded Scenes: A maze with 400 objects vs. a maze with 20.

3. The Test: Not Just "Right or Wrong"

Usually, when we test AI, we just ask, "What is the answer?" and check if it's right.
ORDINALBENCH is stricter. It demands the AI show its work.

Imagine a teacher grading a math test.

  • Old Way: The student writes "42." The teacher checks if 42 is right.
  • ORDINALBENCH Way: The student must write out every single step: "Step 1: I am at the start. Step 2: I turned right. Step 3: I counted one..."

The researchers use special metrics to grade this "step-by-step" trace:

  • Did they get the final answer right? (Final Accuracy)
  • Did they get lost halfway through? (Step Consistency)
  • Did they even try to follow the rules? (Trace Coverage)

4. The Results: The "Crash"

The researchers tested top-tier AI models (like GPT-5, Gemini, and Qwen) on this course. The results were shocking:

  • The "Small Number" Trap: The models were okay with small numbers (like finding the 5th object).
  • The "Big Number" Wall: As soon as the number got big (over 100) or the path got twisty (a maze), the models' performance crashed. They dropped from being "smart" to "random guessers."
  • The "Skip" Confusion: When asked to skip counting (count every 3rd item), the models almost completely failed. It's like asking a human to walk forward but only count their steps if they are wearing red shoes; the AI just forgot the rule entirely.

The Metaphor:
Imagine a marathon runner who is incredibly fast on a straight track (recognizing objects). But the moment you put a complex obstacle course in front of them and say, "Run, but stop and count every third step," they trip, forget the rules, and stop running.

5. Why Does This Happen?

The paper suggests two main reasons:

  1. The "Language Bottleneck": The AI tries to turn the picture into words in its head to solve the problem. But describing a complex maze in words is messy. It loses the "map" and gets confused.
  2. Lack of "Procedural Muscle": The AI is great at knowing facts (declarative knowledge) but bad at doing things step-by-step (procedural knowledge). It hasn't been trained to "hold a state" (remembering "I am currently on step 47") while moving through a visual scene.

6. The Takeaway

This paper isn't just saying "AI is bad at counting." It's saying: "If we want AI to be a true 'Visual Agent' that can navigate the real world (like a robot vacuum or a self-driving car), it needs to learn how to follow complex, multi-step rules, not just recognize pictures."

ORDINALBENCH is the new gym where these AI models can train to get stronger at this specific skill, so one day they won't just "see" the world, but they can truly "navigate" it.