CountEx: Fine-Grained Counting via Exemplars and Exclusion

Imagine you are at a busy party, and your friend asks you to count how many people are wearing red hats.

In the past, if you tried to do this, you might accidentally count the people wearing pink hats or orange hats because they look so similar. You'd end up with a number that's too high. This is the problem most current computer vision systems face: they are great at finding "red things," but they struggle when you need to say, "Count the red hats, but ignore the pink ones."

This paper introduces a new system called CountEx that solves this problem. Here is how it works, explained simply:

1. The Problem: The "Confusing Crowd"

Current AI models are like a guest at the party who only listens to the first part of your sentence. If you say, "Count the red hats," they start counting everything red. If the room is full of red, pink, and orange hats, they get confused and count the wrong ones. They lack the ability to say, "Wait, I don't want the pink ones."

2. The Solution: The "Smart Filter" (CountEx)

CountEx is like a super-smart party guest who listens to the whole sentence. You can say, "Count the red hats, not the pink ones."

To do this, CountEx uses two main tools:

The "Yes" List: A description or a picture of what you want (e.g., "Red hats").
The "No" List: A description or a picture of what you don't want (e.g., "Pink hats").

3. How It Works: The "Sieve and Sponge" Analogy

The magic happens inside a special module called the Discriminative Query Refinement (DQR). Think of this process like a three-step kitchen recipe:

Step 1: The Common Ground (Shared Features)
Imagine you have a bag of red hats and a bag of pink hats. First, CountEx looks at both bags and says, "Okay, these are both hats. They both have a brim and a crown." It creates a "common hat template." This ensures the AI doesn't forget that it's looking for hats at all.
Step 2: The Difference (Exclusive Features)
Next, it looks at the "No" list (the pink hats) and asks, "What makes these specifically pink and not red?" It isolates the "pinkness" and puts it in a separate bucket. It ignores the "hat-ness" and focuses only on the "pinkness."
Step 3: The Filter (Selective Suppression)
Now, it goes back to the "Yes" list (the red hats). It takes the "pinkness" bucket and uses it like a sieve or a sponge. It gently squeezes the "pink" features out of the red hats.
- If a hat is truly red, the sponge doesn't soak it up.
- If a hat is actually pink (but the AI thought it was red), the sponge soaks it up and removes it from the count.

The result? A clean list of only the red hats, with the pink ones perfectly filtered out.

4. The New Playground: CoCount

To teach this system how to do this, the authors built a new training ground called CoCount.

Old Training Data: Was like a classroom where every student was wearing a red shirt. The AI learned to count red shirts, but never had to deal with pink ones.
CoCount: Is like a classroom with 97 different pairs of confusing twins (e.g., black screws vs. silver screws, straight pasta vs. curly pasta). The AI has to learn to tell them apart every single time.

5. Why This Matters

Before this, if you asked an AI to count "white poker chips" in a pile of "blue poker chips," it would likely count the blue ones too, giving you a wrong answer.

With CountEx, you can be precise. You can say, "Count the white chips, not the blue ones," and the AI understands the difference. It's a huge step forward for things like:

Medical Imaging: Counting healthy cells but ignoring diseased ones that look similar.
Crowd Control: Counting people in red uniforms but ignoring people in blue uniforms.
Shopping: Counting specific types of fruit (like Granny Smith apples) while ignoring the Red Delicious ones in the same bin.

In short: CountEx gives AI the ability to say "No" as clearly as it says "Yes," making it much smarter at counting things in messy, complicated scenes.

1. Problem Statement

Visual object counting is a fundamental computer vision task, but existing methods struggle with fine-grained counting in cluttered scenes containing multiple co-existing object categories.

Limitation of Current Methods: Most state-of-the-art (SOTA) prompt-based methods (e.g., CLIP-Count, CountGD) rely on inclusion prompts (specifying what to count). They lack mechanisms to explicitly specify exclusion (what to ignore).
The Core Challenge: In scenes with visually similar distractors (e.g., "penne pasta" vs. "spiral pasta," or "black coins" vs. "silver coins"), models often misinterpret user intent, overcounting confusable categories or defaulting to the dominant class.
Naive Solutions Fail: Simply subtracting the count of negative examples from positive ones is ineffective because it ignores the relational context and shared visual features between the target and distractor classes.

2. Methodology: CountEx Framework

The authors propose CountEx, a discriminative visual counting framework that jointly reasons over inclusion and exclusion cues using multimodal prompts (natural language and optional visual exemplars).

A. Core Architecture

CountEx builds upon a query-based open-vocabulary detector (specifically LLMDet). It processes the input image $I$ along with:

Positive Prompt ( $T_{pos}, E_{pos}$ ): Text and/or bounding boxes of objects to count.
Negative Prompt ( $T_{neg}, E_{neg}$ ): Text and/or bounding boxes of objects to exclude.

The framework generates two separate query sets:

$Q_{pos}$ : Encoded from positive prompts.
$Q_{neg}$ : Encoded from negative prompts.

B. Discriminative Query Refinement (DQR) Module

The novel core of CountEx is the DQR module, which refines $Q_{pos}$ by suppressing distractor patterns without losing target features. It operates in three stages:

Shared Feature Identification:
- The model learns $r$ prototype embeddings ( $C$ ) via cross-attention over the concatenated $[Q_{pos}; Q_{neg}]$ .
- These prototypes capture visual attributes common to both categories (e.g., shape, texture of "pasta" or "coins").
- Loss Functions: A shareability loss ensures prototypes match both sets, while a diversity loss prevents prototype collapse.
Exclusive Feature Extraction:
- The model identifies queries in $Q_{neg}$ that are most distant from the shared feature space (high exclusivity).
- It projects these queries onto the shared subspace and extracts the residuals. These residuals represent features unique to the negative class (e.g., specific color or shape differences) that are not shared with the positive class.
- This creates a compact set of Negative-Exclusive References ( $R_{neg}$ ).
Selective Query Refinement:
- $Q_{pos}$ is refined using cross-attention against $R_{neg}$ .
- A gated residual connection allows the model to adaptively suppress features in $Q_{pos}$ that align with $R_{neg}$ .
- Result: The refined queries ( $\tilde{Q}_{pos}$ ) retain the target object's identity while removing patterns specific to the distractor, enabling precise counting.

C. Training Objective

The model is trained end-to-end with a multi-component loss:

Classification & Localization: Standard focal loss and L1 loss for bounding box/point prediction.
Density Prediction: A dense supervision branch using pseudo ground-truth density maps (Gaussian kernels) to improve spatial awareness.
Prototype Learning: The losses described in the DQR module ( $L_{share}$ and $L_{div}$ ).

3. Key Contributions

Task Formulation: The paper formulates visual counting with explicit exclusion cues, allowing users to specify both what to count and what to ignore via text and visual exemplars.
CountEx Architecture: A novel framework featuring the Discriminative Query Refinement (DQR) module, which effectively isolates negative-exclusive features to refine counting queries without naive subtraction.
CoCount Dataset: A new benchmark designed specifically for fine-grained counting with exclusion.
- Scale: 1,780 videos and 10,086 annotated frames.
- Content: 97 category pairs (both inter-category and intra-category variants, e.g., "black vs. white peppercorns").
- Design: Includes distractor objects and controlled count variations to test reasoning over inclusion/exclusion intent.

4. Experimental Results

The authors evaluated CountEx on CoCount and several external benchmarks:

CoCount Performance:
- Novel-Category Setting (Zero-shot): CountEx achieved 26.61 MAE, outperforming the base LLMDet (33.22 MAE) by a 19.9% error reduction.
- Known-Category Setting: CountEx achieved 12.72 MAE, outperforming the best baseline (CountGD) by 18%.
- Ablation Studies: Demonstrated that negative text prompts significantly reduce error (e.g., from 32.22 to 26.67 MAE in NC-setting) and that the DQR module's loss components are critical for performance.
Generalization (Zero-shot Transfer):
- LOOKALIKES Benchmark: CountEx achieved 18.53 MAE, setting a new SOTA for zero-shot methods. It significantly outperformed CountGD (22.34) and GroundingDINO (33.89). Notably, it outperformed methods requiring per-category synthetic data generation and test-time adaptation.
- PairTally Benchmark: CountEx achieved the best performance across all metrics (MAE and NAE), surpassing both pre-trained specialist counters and general vision-language models.
- FSC-147: When fine-tuned for single-category counting, CountEx achieved competitive results (8.63 MAE), slightly trailing the SOTA CountGD but outperforming many recent methods.

5. Significance and Impact

Bridging the Gap: CountEx addresses a critical gap in visual reasoning: the ability to handle negative constraints in dense, cluttered scenes. This moves counting from simple "find and count" to "find, distinguish, and count."
User Control: By allowing explicit exclusion, the framework offers greater control and reduces ambiguity for end-users, making it more practical for real-world applications like inventory management or medical imaging where specific subtypes matter.
Dataset Contribution: The introduction of CoCount provides the first large-scale, systematic benchmark for evaluating fine-grained counting with exclusion, addressing the bias in previous datasets that often ignored negative prompts.
Efficiency: Unlike competing methods that require synthetic data generation or per-category adaptation (taking minutes per class), CountEx enables real-time interactivity with direct negative prompt specification at inference time.

Limitations: The authors note that the model currently exhibits "positive text dominance" (over-reliance on text descriptions) and struggles with vague prompts due to the limitations of the underlying BERT-based language encoder in inferring visual details from abstract descriptions.