CountEx: Fine-Grained Counting via Exemplars and Exclusion

CountEx is a novel discriminative visual counting framework that leverages multimodal inclusion and exclusion prompts, along with a Discriminative Query Refinement module, to accurately count objects in cluttered scenes by explicitly suppressing visually similar distractors, validated by the new CoCount benchmark.

Yifeng Huang, Gia Khanh Nguyen, Minh Hoai

Published 2026-02-24
📖 4 min read☕ Coffee break read

Imagine you are at a busy party, and your friend asks you to count how many people are wearing red hats.

In the past, if you tried to do this, you might accidentally count the people wearing pink hats or orange hats because they look so similar. You'd end up with a number that's too high. This is the problem most current computer vision systems face: they are great at finding "red things," but they struggle when you need to say, "Count the red hats, but ignore the pink ones."

This paper introduces a new system called CountEx that solves this problem. Here is how it works, explained simply:

1. The Problem: The "Confusing Crowd"

Current AI models are like a guest at the party who only listens to the first part of your sentence. If you say, "Count the red hats," they start counting everything red. If the room is full of red, pink, and orange hats, they get confused and count the wrong ones. They lack the ability to say, "Wait, I don't want the pink ones."

2. The Solution: The "Smart Filter" (CountEx)

CountEx is like a super-smart party guest who listens to the whole sentence. You can say, "Count the red hats, not the pink ones."

To do this, CountEx uses two main tools:

  • The "Yes" List: A description or a picture of what you want (e.g., "Red hats").
  • The "No" List: A description or a picture of what you don't want (e.g., "Pink hats").

3. How It Works: The "Sieve and Sponge" Analogy

The magic happens inside a special module called the Discriminative Query Refinement (DQR). Think of this process like a three-step kitchen recipe:

  • Step 1: The Common Ground (Shared Features)
    Imagine you have a bag of red hats and a bag of pink hats. First, CountEx looks at both bags and says, "Okay, these are both hats. They both have a brim and a crown." It creates a "common hat template." This ensures the AI doesn't forget that it's looking for hats at all.

  • Step 2: The Difference (Exclusive Features)
    Next, it looks at the "No" list (the pink hats) and asks, "What makes these specifically pink and not red?" It isolates the "pinkness" and puts it in a separate bucket. It ignores the "hat-ness" and focuses only on the "pinkness."

  • Step 3: The Filter (Selective Suppression)
    Now, it goes back to the "Yes" list (the red hats). It takes the "pinkness" bucket and uses it like a sieve or a sponge. It gently squeezes the "pink" features out of the red hats.

    • If a hat is truly red, the sponge doesn't soak it up.
    • If a hat is actually pink (but the AI thought it was red), the sponge soaks it up and removes it from the count.

The result? A clean list of only the red hats, with the pink ones perfectly filtered out.

4. The New Playground: CoCount

To teach this system how to do this, the authors built a new training ground called CoCount.

  • Old Training Data: Was like a classroom where every student was wearing a red shirt. The AI learned to count red shirts, but never had to deal with pink ones.
  • CoCount: Is like a classroom with 97 different pairs of confusing twins (e.g., black screws vs. silver screws, straight pasta vs. curly pasta). The AI has to learn to tell them apart every single time.

5. Why This Matters

Before this, if you asked an AI to count "white poker chips" in a pile of "blue poker chips," it would likely count the blue ones too, giving you a wrong answer.

With CountEx, you can be precise. You can say, "Count the white chips, not the blue ones," and the AI understands the difference. It's a huge step forward for things like:

  • Medical Imaging: Counting healthy cells but ignoring diseased ones that look similar.
  • Crowd Control: Counting people in red uniforms but ignoring people in blue uniforms.
  • Shopping: Counting specific types of fruit (like Granny Smith apples) while ignoring the Red Delicious ones in the same bin.

In short: CountEx gives AI the ability to say "No" as clearly as it says "Yes," making it much smarter at counting things in messy, complicated scenes.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →