CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting

Imagine you are at a busy party. You look around and instantly know there are about 50 people in the room. You don't need to know their names, what they do for a living, or even if you've seen them before. You just see the shapes, the repeating patterns, and how their heads and bodies fit together to form a whole person.

Now, imagine asking a computer to do the same thing. If you show it a picture of a crowd of people, it's great. But if you show it a picture of 50 pairs of sunglasses, or a pile of identical Lego bricks, the computer often gets confused. It might count every single lens of the glasses as a separate person, or count every Lego brick as a whole toy. It sees the parts but misses the whole.

This is the problem the paper "CountFormer" tries to solve.

The Problem: The Computer's "Zoom-In" Trouble

Most computer vision models are like a person who only knows how to count by looking at a specific type of object. If you ask them to count "cars," they are great. But if you ask them to count "glasses" without showing them a sample first, they panic.

They tend to get "part-level" confused.

The Glasses Mistake: A computer might look at a pair of sunglasses and think, "I see two round shapes! That's two objects!" It forgets that those two shapes are connected by a bridge and are actually just one pair of glasses.
The Lego Mistake: In a pile of tiny, identical blocks, the computer might count every single bump on the block as a separate item.

The Solution: CountFormer

The authors built a new tool called CountFormer. Think of it as giving the computer a pair of "smart glasses" that help it understand structure and repetition, rather than just memorizing what things look like.

Here is how they did it, using some simple analogies:

1. The "Super-Reader" (DINOv2)

Usually, computers are trained to recognize specific things (like "cat" or "dog"). The authors decided to use a pre-trained "foundation model" called DINOv2.

Analogy: Imagine a student who has read every book in the library but hasn't been tested on specific questions yet. This student understands the flow of language, the structure of sentences, and how words relate to each other, even if they haven't seen the specific story you are asking about.
In the paper: DINOv2 is this "super-reader." It looks at an image and understands the visual "grammar"—how parts fit together to make a whole—without needing to be told what the object is called.

2. The "Map Coordinates" (Positional Embeddings)

The "super-reader" is great at understanding what things are, but sometimes it forgets where they are exactly.

Analogy: Imagine you are describing a room to someone over the phone. You say, "There's a chair." But you don't say where it is. The listener gets confused. Now, imagine you add, "The chair is in the top-left corner." Suddenly, it makes sense.
In the paper: The authors added "positional embeddings." This is like giving the computer a GPS coordinate for every part of the image. It ensures the computer knows that the two lenses of the sunglasses are right next to each other, connected by a bridge, rather than being two random floating circles.

3. The "Density Map" (The Final Count)

Instead of trying to draw a box around every single object (which is hard when objects are crowded), the model creates a "heat map."

Analogy: Imagine sprinkling flour over a table of cookies. Where there are cookies, the flour piles up high. Where there are no cookies, it's flat. If you weigh the total amount of flour, you can figure out how many cookies there are without even counting them one by one.
In the paper: The model creates a "density map." It paints a picture where the "hot" spots represent objects. The computer then adds up all the "heat" to get the final number.

Did It Work?

The authors tested this on a dataset called FSC-147, which has thousands of images of weird, random objects (from birds to pens to Lego).

The Result: The model didn't win every single math test. In fact, on the big, messy numbers, it was about the same as other top models.
The Real Win: When they looked at the pictures, CountFormer made fewer "silly" mistakes.
- The Glasses Test: When shown a picture of glasses, other models counted the lenses separately (getting the number wrong). CountFormer saw the whole pair and got it right.
- The "Why": Because it understood the structure (the bridge connecting the lenses) thanks to the "Super-Reader" and the "GPS coordinates."

The Catch (The "Dense Crowd" Problem)

The paper admits one big weakness. If you show the computer a picture of a million tiny Lego bricks packed so tight you can't see the gaps, the model still gets confused.

Analogy: If you pour a bucket of sand and ask someone to count the grains, even a smart person will struggle. The computer struggles here too because the "grains" blend together.
The Insight: The authors found that a few of these "super crowded" pictures were messing up the average scores. If you remove those 4 hardest pictures, the model looks much better. This tells us that the model is actually quite good, but the test is very harsh on crowded scenes.

The Big Takeaway

The main lesson of this paper isn't just "we built a better counter." It's that how a computer "sees" matters more than the math it uses to count.

By giving the computer a better way to understand visual structure (using DINOv2) and spatial location (using position maps), they made it smarter at counting things it has never seen before. It's a step toward machines that can look at a pile of weird junk and say, "Ah, I see 12 distinct items," just like a human would, without needing a manual or a sample.

Here is a detailed technical summary of the paper "CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting."

1. Problem Statement

The paper addresses the challenge of class-agnostic object counting in an exemplar-free setting.

The Gap: While humans can count unfamiliar objects by recognizing visual repetition and structural composition (e.g., realizing two lenses belong to one pair of glasses), current machine learning models often fail.
The Failure Mode: Existing exemplar-free models (those that do not use reference patches or text prompts at test time) struggle with structural coherence. They tend to overcount composite objects by treating individual parts (like lenses or petals) as separate instances, or they fail to generalize to unseen object categories.
Limitations of Current SOTA:
- CLIP-based models: Prioritize semantic identity over spatial geometry, leading to errors in structurally complex objects.
- Standard Transformers (e.g., CounTR): Capture global attention but often lack the fine-grained spatial awareness needed to bind parts into a whole object.

2. Methodology: CountFormer

The authors propose CountFormer, a controlled adaptation of the standard density-regression framework (inspired by CounTR). The core innovation lies not in a new loss function or counting objective, but in the representation learning strategy.

A. Architecture Overview

The model follows a three-stage pipeline:

Visual Encoder (DINOv2):
- Instead of standard CNNs or CLIP, the authors utilize DINOv2, a self-supervised vision foundation model.
- DINOv2 is chosen because it learns directly from visual data (without text), encoding both semantic meaning and spatial structure more effectively than CLIP.
- It extracts token features ( $F_{DINO}$ ) from the input image.
Positional Embedding Fusion:
- To ensure the transformer tokens retain geometric context, explicit 2D positional embeddings ( $E$ ) are added to the visual features.
- Formula: $F_E = E + F_{DINO}$ .
- This step provides explicit spatial grounding, helping the model understand the relative positions of object parts.
Lightweight Decoder:
- A ConvNet-based decoder with four upsampling stages processes the fused features.
- It generates a continuous density map ( $y_i$ ) at the original image resolution.
- The final object count is obtained by integrating (summing) the pixel values of this density map.

B. Training and Inference

Dataset: FSC-147 (Few-Shot Counting 147), containing 6,135 images across 147 classes.
Protocol: Strictly exemplar-free. No reference patches or text prompts are provided during inference.
Loss: Standard density regression loss (L2 loss between predicted and ground-truth density maps).
Preprocessing: Images are resized to 256x256, center-cropped to 224x224, and augmented with flips/rotations.

3. Key Contributions

Controlled Integration of Foundation Models: The paper introduces a rigorous study on how self-supervised foundation features (DINOv2) influence structural robustness in counting, without altering the standard counting objective.
Explicit Spatial Grounding: The incorporation of 2D positional embeddings into transformer token representations ensures that the model maintains geometric consistency, addressing the "part-level overcounting" issue.
Diagnostic Sensitivity Analysis: The authors provide a novel analysis revealing that standard evaluation metrics (MAE/RMSE) are disproportionately skewed by a small number of extreme high-density scenes.
Qualitative Improvement: Demonstrated reduction in part-level overcounting for composite objects (e.g., glasses, birds) compared to text-prompted or standard transformer baselines.

4. Results and Evaluation

Quantitative Performance (FSC-147 Benchmark)

Under the official benchmark protocol (including all images):

Test Set MAE: 19.06
Test Set RMSE: 118.45
Comparison: The performance is competitive with state-of-the-art exemplar-free methods like CounTR (MAE 14.71) and RCC (MAE 17.12), though slightly higher in aggregate error.

Diagnostic Sensitivity Analysis

The authors identified that four extreme high-density scenes (with thousands of objects and weak boundaries) disproportionately inflate the RMSE.

Excluding these 4 cases:
- Test MAE drops to 13.14.
- Test RMSE drops drastically to 33.05.
Significance: This highlights that the model's performance is robust for typical scenes but struggles with extreme density, and that standard metrics may mask structural improvements in average cases.

Qualitative Results

Success Case (Glasses): In images of sunglasses, competing models (like CounTX) often count each lens as a separate object (overcounting). CountFormer successfully groups the lenses into a single object instance, yielding a count closer to the ground truth.
Failure Case (Dense Lego): The model still struggles with extremely dense, small objects with weak inter-object boundaries (e.g., tightly packed Lego bricks), leading to undercounting due to overlapping density activations.

5. Significance and Conclusion

Representation over Architecture: The paper argues that for class-agnostic counting, the quality of visual representation (specifically structural awareness) is more critical than designing complex new counting architectures.
Structural Coherence: By leveraging DINOv2, the model achieves better "part-whole" coherence, reducing the tendency to count fragments as whole objects.
Evaluation Insight: The study provides a critical perspective on benchmarking, suggesting that aggregate metrics like RMSE can be misleading if not analyzed alongside specific failure modes (e.g., extreme density).
Future Work: The authors suggest that future improvements will require higher-resolution inputs and multi-scale aggregation to handle the most challenging dense scenes, rather than further architectural changes to the decoder.

In summary, CountFormer demonstrates that integrating self-supervised foundation models with explicit positional grounding offers a promising path toward more human-like, structurally aware object counting, even without exemplars.