LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models

This paper proposes LVLM-COUNT, a divide-and-conquer baseline method that enhances large vision-language models' ability to count large numbers of objects by decomposing tasks and preventing object splitting to avoid repetitive counting.

Muhammad Fetrat Qharabagh, Mohammadreza Ghofrani, Kimon Fountoulakis

Published 2026-02-17
📖 5 min read🧠 Deep dive

The Problem: The "Overwhelmed Librarian"

Imagine you have a very smart, well-read librarian (this is the Large Vision-Language Model, or LVLM). This librarian can read a book, look at a picture, and tell you exactly what's happening in it. They are amazing at understanding complex stories.

However, if you ask this librarian, "How many red apples are in this giant basket?" and there are only three apples, they will get it right. But if you show them a basket with 300 apples, they often get confused. They might say "298" or "315."

Why? Because our current AI models are great at recognizing things but terrible at counting large groups. It's like trying to count every single grain of sand on a beach while looking at the whole beach at once; your brain just gets overwhelmed.

The Solution: The "Divide and Conquer" Strategy

The authors of this paper realized that instead of asking the librarian to count the whole beach at once, we should give them a better strategy. They call their method LVLM-Count.

Think of it like this:

  1. The Naive Approach: You hand the librarian a photo of a crowded stadium and ask, "How many people are wearing blue hats?" They squint, get dizzy, and guess wrong.
  2. The LVLM-Count Approach: You take a pair of scissors and cut the photo into smaller, manageable pieces. You hand the librarian one small piece and ask, "How many blue hats are here?" Then you do the same for the next piece. Finally, you add up all the small answers to get the total.

This sounds simple, but there's a huge trap in this plan.

The Trap: The "Sliced Pizza" Problem

If you just cut the photo with straight lines (like slicing a pizza), you might accidentally cut a person wearing a blue hat right in half.

  • The Mistake: The librarian looks at the left slice and sees a "half-hat." They look at the right slice and see another "half-hat." They might count them as two different people instead of one. Or, they might get confused and miss one entirely.

This is called double-counting or fragmentation.

The Secret Sauce: "Object-Aware" Cutting

This is where the paper's main innovation shines. The authors didn't just use straight scissors. They built a smart cutting machine.

  1. Spotting the Targets: First, the system uses a "spotter" (a detection model) to find exactly where the blue hats are.
  2. Drawing the Map: It creates a map where the hats are "obstacles" (like rocks in a river).
  3. The Pathfinding: The system then uses a pathfinding algorithm (like a GPS finding a route around traffic) to draw cutting lines. The lines are forced to go around the hats, never through them.

The Analogy: Imagine you are cutting a cake, but you have to avoid cutting through the cherries on top. You carefully maneuver your knife so it slices the cake between the cherries, ensuring every cherry stays whole on one piece of cake.

How It Works Step-by-Step

  1. Listen: The AI reads your question (e.g., "Count the brown eggs").
  2. Zoom In: It crops the image to only show the area with the eggs, ignoring the rest of the kitchen.
  3. Map the Eggs: It draws invisible outlines around every single egg.
  4. The Smart Cut: It draws a line through the image, but the line bends and twists to go between the eggs, never slicing one in half.
  5. Count & Add: It sends these smaller, safe pictures to the AI librarian. The librarian counts the eggs in each small picture easily. The system adds them all up for the final answer.

Why This Matters

The paper tested this method on many difficult datasets:

  • Crowded scenes: Like a penguin colony where penguins are piled on top of each other.
  • Complex objects: Like counting specific emojis that look very similar (e.g., a clock showing 2:30 vs. 2:35).
  • Real-world jobs: Counting barrels in a warehouse or cells in a microscope.

The Result: The AI didn't just get slightly better; it got significantly better. It turned a confused guesser into a reliable counter, even for hundreds of objects.

The Takeaway

The paper teaches us that when AI struggles with big numbers, we don't necessarily need to make the AI "smarter" (which is hard and expensive). Instead, we can make the task easier for the AI by breaking the problem down and being careful not to break the objects we are counting.

It's the difference between asking a friend to count a pile of 500 coins in one go versus giving them a tray to sort them into piles of 10. The friend is the same, but the method makes the difference.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →