Counting Through Occlusion: Framework for Open World Amodal Counting

Imagine you are standing in a crowded room trying to count how many people are there.

The Problem: The "Peek-a-Boo" Failure
Most current computer programs are like a child playing peek-a-boo. If they can see a person's face, they count them. But if a tall person stands in front of them, blocking the view, the computer thinks, "Oh, that person isn't there anymore!" It only counts what is strictly visible. It lacks the imagination to say, "Wait, I know someone is standing behind that tall person because I can see their shoes."

In the world of AI, this is called occlusion. When objects hide behind other things, standard counting AI fails miserably. It gets confused by the "blocking" object (the tall person) and forgets the hidden one.

The Solution: CountOCC (The "Imaginative Detective")
The authors of this paper created a new system called CountOCC. Think of CountOCC not just as a camera, but as an imaginative detective.

Instead of just looking at what's visible, CountOCC has two special superpowers:

1. The "Feature Reconstruction" Module (The 3D Printer)

Imagine you have a broken puzzle piece. A normal computer tries to count the puzzle by looking at the jagged, broken edges. It gets confused.

CountOCC, however, looks at the broken piece and says, "I know what the whole piece looks like." It uses clues from the visible parts (the puzzle piece you can see) and combines them with a "mental blueprint" (learned from text descriptions and other examples) to reconstruct the missing part of the object in its mind.

The Analogy: It's like seeing a car parked behind a fence. A normal AI sees a fence and a bumper. CountOCC uses the bumper and its knowledge of what cars look like to "print" the invisible middle and back of the car in its digital brain, allowing it to count the whole car, not just the bumper.

2. The "Visual Equivalence" Check (The Double-Check)

To make sure it's not just hallucinating (making things up), CountOCC uses a "Teacher-Student" system.

The Teacher looks at a clear, unblocked photo and learns what the attention map (where the AI looks) should look like.
The Student looks at the blocked photo and tries to make its "attention map" look exactly like the Teacher's.

If the Student tries to count only the visible parts, its attention map will look different from the Teacher's, and the system corrects it. This forces the AI to realize, "Hey, even though I can't see the back of the car, my focus should still be on the entire car, just like the Teacher's."

The New Training Grounds

To prove this works, the researchers didn't just test on normal photos. They created new, harder tests (FSC-147-OCC and CARPK-OCC).

They took thousands of photos of cars, people, and objects.
They digitally painted black boxes over them to simulate heavy blocking.
They asked the AI to count the total number of objects, hidden and visible.

The Results: A Giant Leap
When they ran the tests, CountOCC was a superstar.

Old AI: Counted only the visible cars. If 5 cars were hidden, it missed 5.
CountOCC: Counted the visible cars plus the hidden ones.
The Score: It reduced counting errors by nearly 50% compared to the best previous methods. It was so good that it worked even on datasets it had never seen before (like parking lots), proving it truly learned the concept of counting hidden things, not just memorized answers.

Why This Matters

This isn't just about counting cars. Imagine:

Farmers: Counting crops hidden behind tall weeds to know how much food they will harvest.
Factories: Counting items on a conveyor belt even when boxes block the view.
Crowd Safety: Estimating how many people are in a dense crowd, even if they are packed so tight you can only see heads.

In a Nutshell:
Previous AI was like a person who only counts what they can see with their eyes open. CountOCC is like a person who can close their eyes, use their memory and logic, and still tell you exactly how many people are in the room, even if half of them are hiding behind a wall. It teaches computers to "see" the invisible.

1. Problem Statement

The Challenge of Occlusion in Open-World Counting:
Current state-of-the-art (SOTA) open-world object counting methods (e.g., CountGD, LOCA, CounTR) excel at counting fully visible objects using visual exemplars or text prompts. However, they fail significantly when objects are occluded.

Architectural Limitation: Existing models rely on backbone networks that encode the visible pixels. When an object is occluded, the backbone encodes the occluding surface (background or foreground clutter) rather than the target object. This corrupts the feature representation, making it impossible for the model to infer the existence or count of hidden instances.
The Gap: While humans can infer the existence of occluded objects (amodal perception), current AI models treat occluded regions as background, leading to severe undercounting in cluttered real-world scenarios (e.g., parking lots, retail shelves, crowds).

2. Methodology: CountOCC

The authors propose CountOCC, the first open-world amodal counting framework that explicitly reconstructs occluded object features. The architecture extends the CountGD baseline with two core mechanisms:

A. Feature Reconstruction Module (FRM)

The FRM operates in the feature space to recover discriminative representations for occluded regions across multiple hierarchical pyramid levels.

Visible-Occluded Separation: The model decomposes backbone features into visible tokens ( $Z_{vis}$ ) and occluded positions. Occluded positions are initialized with learnable query tokens ( $Q_0$ ) derived from a trainable mask embedding.
Hierarchical Attention Fusion:
1. Self-Attention: The occluded queries model interdependencies among masked positions.
2. Cross-Attention (Spatial): Queries attend to visible tokens ( $Z_{vis}$ ) to aggregate spatial context from unoccluded areas.
3. Cross-Attention (Semantic): The spatially informed queries are modulated by fused text-visual embeddings ( $Z_{v,t}$ ) to inject class-specific semantic guidance.
Reconstruction: A Multi-Layer Perceptron (MLP) transforms these conditioned queries into reconstructed features ( $\hat{Z}_{occ}$ ), which replace the corrupted features in the occluded regions.

B. Visual Equivalence (VisEQ) Supervision

To ensure the reconstructed features are semantically consistent with real objects, the authors introduce a teacher-student distillation framework operating in the attention space.

Teacher-Student Setup:
- Teacher: Processes the original, unoccluded image to generate a "ground truth" attention map ( $G_T$ ).
- Student: Processes the occluded image (with reconstructed features) to generate an attention map ( $G_S$ ).
Gradient-Based Attention Alignment: Using a Language-Conditioned GradCAM approach, the model computes attention maps based on the matching score between the decoder output and the text/exemplar prompts.
Loss Functions:
- Attention Similarity Loss ( $L_{sim}$ ): Enforces pixel-wise $\ell_2$ and cosine similarity between $G_T$ and $G_S$ , ensuring the student focuses on the same object evidence as the teacher despite occlusion.
- Region of Interest (RoI) Consistency Loss ( $L_{cst}$ ): Prevents trivial solutions (e.g., predicting zero everywhere) by enforcing high mean activation and low variance in confident regions.

C. Training Strategy

Two-Stage Curriculum:
1. Stage 1: Train only the FRM using reconstruction losses ( $\ell_2$ , cosine, Charbonnier) on synthetically occluded FSC-147 data.
2. Stage 2: Jointly train FRM and VisEQ components, refining the alignment between visible and occluded views.
Data Augmentation: On-the-fly object-aware occlusion is applied during training, where rectangular masks are anchored to ground-truth objects to simulate realistic partial/full occlusion.

3. Key Contributions

First Open-World Amodal Framework: CountOCC is the first method to explicitly reconstruct and reason about occluded object instances in an open-world setting (arbitrary categories specified by text/exemplars).
Novel Architectural Components:
- Feature Reconstruction Module (FRM): Recovers class-discriminative features for hidden regions using hierarchical spatial-semantic attention.
- Visual Equivalence (VisEQ): Enforces attention consistency between occluded and unoccluded views via teacher-student distillation.
New Benchmarks: The authors established rigorous evaluation protocols by creating occlusion-augmented versions of standard datasets:
- FSC-147-OCC: Occlusion-augmented FSC-147.
- CARPK-OCC: Occlusion-augmented CARPK (parking lot dataset).
- These benchmarks preserve original annotations while introducing controlled occlusion patterns.
Comprehensive Evaluation: The framework was evaluated on FSC-147-OCC, CARPK-OCC, and the existing CAPTURe-Real dataset, demonstrating robustness across structured and unstructured scenes.

4. Experimental Results

CountOCC achieved State-of-the-Art (SOTA) performance across all benchmarks, significantly outperforming prior baselines (CountGD, LOCA, CounTR, etc.).

FSC-147-OCC:
- Validation: 26.72% reduction in Mean Absolute Error (MAE) compared to CountGD.
- Test: 20.80% reduction in MAE.
- RMSE: Improvements of 34.90% (val) and 54.71% (test).
CARPK-OCC (Zero-Shot Generalization):
- Achieved a 49.89% reduction in MAE and 47.56% reduction in RMSE compared to CountGD.
- Demonstrated exceptional generalization to unseen domains (traffic scenes) without fine-tuning.
CAPTURe-Real:
- Achieved a 28.79% reduction in MAE, validating performance on pattern-based occlusion.
Real-World Application: Tested on the CrowdHuman dataset (natural inter-person occlusion), showing a 17.35% MAE reduction over CountGD.
Ablation Studies: Confirmed that both multi-level feature reconstruction and VisEQ supervision are critical; removing either leads to significant performance degradation.

5. Significance and Impact

Paradigm Shift: Moves open-world counting from "counting what is visible" to "counting what exists," bridging the gap between computer vision and human amodal perception.
Practical Utility: Directly addresses critical needs in inventory management (retail), traffic monitoring (parking), and agricultural yield estimation where occlusion is common.
Robustness: Proves that explicit feature reconstruction is necessary to handle the "corruption" caused by occluding surfaces in deep learning backbones.
Future Directions: While the model excels at counting totals, the authors note a limitation in precise spatial localization of hidden instances (the density map is accurate, but exact coordinates of occluded objects may vary). Future work could integrate precise amodal detection.

In summary, CountOCC solves the fundamental architectural flaw in current counting models by synthesizing complete object representations from partial visual cues and semantic priors, setting a new standard for robust object counting in complex, occluded environments.

Counting Through Occlusion: Framework for Open World Amodal Counting

1. The "Feature Reconstruction" Module (The 3D Printer)

2. The "Visual Equivalence" Check (The Double-Check)

The New Training Grounds

Why This Matters

1. Problem Statement

2. Methodology: CountOCC

A. Feature Reconstruction Module (FRM)

B. Visual Equivalence (VisEQ) Supervision

C. Training Strategy

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers