RepSFNet : A Single Fusion Network with Structural Reparameterization for Crowd Counting

RepSFNet is a lightweight, real-time crowd counting architecture that utilizes a RepLK-ViT backbone with structural reparameterization and a specialized fusion module to achieve high accuracy and low latency by eliminating complex attention mechanisms while effectively addressing scale variations and occlusions.

Mas Nurul Achmadiah, Chi-Chia Sun, Wen-Kai Kuo, Jun-Wei Hsieh

Published 2026-02-24
📖 4 min read☕ Coffee break read

Imagine you are standing on a balcony looking down at a massive, chaotic concert crowd. Your job is to count every single person. Some people are packed tight like sardines, others are spread out, some are hidden behind others, and the lighting keeps changing. Doing this by eye is impossible, and doing it with a computer is usually slow and requires a supercomputer.

This paper introduces RepSFNet, a new "smart camera brain" designed to count crowds quickly and accurately, even on small, low-power devices like a smartphone or a drone.

Here is the breakdown of how it works, using simple analogies:

1. The Problem: The "Blurry Binoculars" Issue

Existing crowd-counting computers are like people trying to count a crowd through binoculars that are either too weak (they miss people in the distance) or too heavy (they take forever to focus).

  • Old methods often use complex "attention" mechanisms. Think of this as a security guard who stops to stare at every single person individually to make sure they aren't hiding. It's accurate, but it's slow and exhausting.
  • The Goal: We need a system that is fast, light, and sees the whole picture at once without getting tired.

2. The Solution: RepSFNet (The "Super-Scanner")

The authors built a new architecture called RepSFNet. Think of it as a high-speed scanner that uses a special trick called Structural Reparameterization.

A. The "Magic Lens" (RepLK-ViT Backbone)

Imagine you have a camera lens. Usually, to see things far away, you need a huge, heavy lens.

  • The Trick: RepSFNet uses a "magic lens" (Large Kernel Convolution). During the training phase, it uses a massive, complex lens to learn how to see everything. But right before it starts working in the real world, it performs a "magic trick" (reparameterization). It folds that massive, heavy lens into a tiny, lightweight one that acts exactly like the big one but is much faster to use.
  • The Result: It gets the wide view of a giant telescope but the speed of a pocket camera.

B. The "Swiss Army Knife" (Feature Fusion)

Once the camera sees the crowd, it needs to understand the details.

  • ASPP (The Zoom Lens): This part looks at the crowd from different "zoom levels" simultaneously. It sees the whole stadium, the section, and the individual groups all at once.
  • CAN (The Spotlight): This part acts like a smart spotlight. If a section of the crowd is very dense, the spotlight gets brighter there. If it's sparse, it dims. It adapts to the crowd's density automatically.
  • The Fusion: RepSFNet combines these two views into one clear picture, ensuring it doesn't miss anyone in the back or double-count people in the front.

C. The "No-Clutter" Design (Single Fusion)

Many old systems are like a kitchen with too many open doors and hallways (multi-branch designs). This slows down the traffic.

  • RepSFNet is a straight hallway. It takes the input, processes it efficiently, and spits out the answer. By removing the "hallways" and "attention guards," it saves a massive amount of energy and time.

3. The "Teacher" (The Loss Function)

How does the computer learn?

  • The Count (MAE): The teacher checks the total number. "Did you count 100 people? Good."
  • The Map (Optimal Transport): This is the clever part. The teacher doesn't just care about the total number; they care about where the people are. If the computer says "100 people" but puts them all in the top-left corner, the teacher says, "Wrong! The people are actually spread out."
  • The Result: The computer learns to draw a perfect "heat map" of the crowd, not just guess a number.

4. The Results: Fast and Lean

The authors tested this new system against the best existing ones (like P2PNet and STEERER) on famous crowd datasets.

  • Accuracy: It is just as good as the heavy, slow systems. In some tests (like the NWPU dataset), it was actually the best.
  • Speed: This is the big win. Because it removed the "heavy lifting" parts, it is 34% faster than its competitors.
  • Real-World Use: It's so efficient that it can run on low-power devices (Edge Computing). Imagine a drone flying over a protest or a festival, counting people in real-time without needing a massive server farm to process the data.

Summary

RepSFNet is like upgrading from a heavy, slow-moving tank to a sleek, high-speed sports car. It uses a "magic lens" trick to see everything clearly, combines different views to understand the crowd's density, and strips away all the unnecessary weight to run incredibly fast. It proves you don't need a supercomputer to count a crowd; you just need a smart, efficient design.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →