Phys-3D: Physics-Constrained Real-Time Crowd Tracking and Counting on Railway Platforms

Imagine you are sitting on a train as it slowly pulls into a busy station. The platform is packed with people. Your job is to count exactly how many people are waiting there, in real-time, without missing anyone or counting the same person twice.

This sounds easy, but doing it with a camera on a moving train is a nightmare for computers. Here is why:

The Train is Moving: As the train approaches, the people on the platform look like they are zooming toward the camera, even if they are standing still. It's like looking out a car window; the trees seem to rush past, but they aren't moving.
The Crowd is Dense: People are shoulder-to-shoulder. Their bodies block each other, making it hard to see where one person ends and another begins.
The Perspective is Weird: People far away look tiny, and people close up look huge. A standard computer program gets confused by this rapid size change.

The paper "Phys-3D" proposes a clever solution to this problem. Instead of just using a "smart camera," they built a system that understands physics and geometry.

Here is how their system works, broken down into simple parts:

1. The Detective: "Head Hunting"

Most security cameras try to spot a whole person (head to toe). But on a crowded platform, legs and torsos get hidden behind other people.

The Analogy: Imagine trying to spot a flock of birds in a tree. If you look for the whole bird, the branches hide them. But if you just look for the heads, they are usually visible above the branches.
The Solution: The system ignores bodies and only looks for heads. It uses a super-smart AI (a detector called YOLOv11m) trained specifically to find heads even when they are squished together or blurry.

2. The Tracker: The "Physics Coach"

Once the camera spots a head, it needs to follow that person as the train moves.

The Problem: Standard tracking software assumes the camera is standing still. When the train moves, the software gets confused. It thinks the people are running toward the train because they are getting bigger in the picture. It loses track of them or swaps their identities (thinking Person A is now Person B).
The Solution (Phys-3D): The authors created a new tracker called Phys-3D. Think of this tracker as a physics coach who knows how trains work.
- Instead of just watching the 2D picture on the screen, the coach imagines the people in 3D space.
- It knows: "The train is slowing down. The people are standing still. Therefore, if they look like they are zooming toward us, it's because we are moving, not them."
- By applying the laws of physics (like how a pinhole camera works), it separates the motion of the train from the motion of the people. This keeps the "ID tag" on each person stable, even when they are hidden for a second.

3. The Counter: The "Virtual Hallway"

Even with a good tracker, counting is tricky. If a person is blocked by a pole for a split second, a simple counter might think they left and then re-entered, counting them twice.

The Analogy: Imagine a bouncer at a club. If you just stand at the door, you might miss someone who steps back inside. But if you have a small hallway (a "virtual counting band") inside the club, you only count someone when they stay in that hallway for a few seconds.
The Solution: The system creates invisible "zones" on the platform. A person is only counted if they stay in that zone for a few frames (moments). This filters out the "jitter" and ensures that if someone is briefly hidden, they aren't counted twice.

The Results: Why Does This Matter?

The team tested this on a real dataset of train platforms (which they created because no good one existed).

Accuracy: Their system made a counting error of only 2.97%. That means if there are 100 people, they are almost always right.
Speed: It runs in real-time, meaning it can be used on the train while it is arriving.

The Big Picture

This isn't just about counting heads. It's about safety and efficiency.

Safety: If the platform is too crowded, the train can be delayed to let people off first, preventing accidents.
Efficiency: Station managers can see exactly how many people are waiting and decide if they need to send a bigger train or more staff.

In summary: The paper teaches a computer how to be a smart observer on a moving train. By focusing on heads, understanding the physics of the train's movement, and using a "waiting zone" to count, it solves a problem that has been too messy for computers to handle until now. It turns a chaotic, blurry video into a precise, reliable number.

1. Problem Statement

Accurate, real-time crowd counting on railway platforms is critical for safety management and train scheduling. However, existing solutions face significant challenges when deployed on moving trains (onboard cameras) rather than static surveillance systems:

Dynamic Conditions: Severe perspective distortion, rapid scale changes (heads grow as the train approaches), and motion blur.
Occlusion: Dense crowds cause extensive mutual occlusion, making full-body detection unreliable.
Ego-Motion Confusion: Standard tracking algorithms (e.g., Constant Velocity Kalman filters) misinterpret camera-induced apparent motion as target motion, leading to trajectory drift, identity switches, and inaccurate counts.
Counting Brittleness: Naïve counting methods (e.g., crossing a single line) fail under detection jitter and temporary occlusions, causing duplicate or missed counts.

2. Methodology

The authors propose Phys-3D, a real-time, end-to-end Detect-Track-Count pipeline designed specifically for train-mounted cameras. The system consists of three core stages:

A. Head-Based Detection & Encoding

Strategy: Instead of full-body detection, the system focuses on head detection, which remains visible and stable even in dense crowds.
Detector: Uses YOLOv11m. It employs a two-stage transfer learning approach:
1. Pre-training on the large-scale CrowdHuman dataset.
2. Fine-tuning on a custom, domain-specific dataset (RailwayPlatformCrowdHead).
Re-Identification (ReID): Uses EfficientNet-B0 to extract 128-dimensional appearance embeddings. This allows the system to re-identify pedestrians after temporary occlusions.

B. Phys-3D: Physics-Constrained 3D Tracking

This is the core innovation. Unlike standard DeepSORT which models motion in 2D image space, Phys-3D models motion in 3D space using a pinhole camera model.

State Vector: The Kalman filter state is defined as $x_{Phs3D} = [X, Y, H, Z, \dot{Z}, \ddot{Z}]^T$ , representing 3D position, head height, and depth dynamics.
Geometric Constraints:
- Assumes the train moves along a straight track, meaning lateral coordinates ( $X, Y$ ) and head height ( $H$ ) are relatively constant for a pedestrian on the platform.
- The primary change is in depth ( $Z$ ), driven by the train's deceleration.
- It relates the 2D bounding box height ( $h$ ) to 3D distance ( $Z$ ) via the pinhole equation: $Z(t) = f_y \cdot H / h(t)$ .
Benefit: By decoupling true pedestrian motion from camera ego-motion, the tracker maintains physically plausible trajectories, significantly reducing identity switches and drift during train deceleration.

C. Virtual Counting Band

To convert trajectories into reliable counts, the system avoids simple line-crossing methods.

Mechanism: Defines a "Virtual Counting Band" (a region near the platform edge) with specific start/end boundaries relative to image width.
Persistence: A target is only counted if it remains continuously within the band for a set number of frames ( $N$ ).
Robustness: This temporal persistence window smooths out detection jitter and brief occlusions, preventing double-counting or missed entries.

3. Key Contributions

Phys-3D Tracker: A novel Kalman filter variant that integrates first-principles geometry and ego-motion priors into the state prediction, solving the instability caused by moving cameras and perspective distortion.
Domain-Specific Dataset (MOT-RPCH): The release of the RailwayPlatformCrowdHead dataset, containing 27 video sequences, ~89k bounding boxes, and 885 unique identities, specifically annotated for head-based detection from a train viewpoint.
End-to-End Pipeline: A unified system combining head detection, appearance-based ReID, and physics-constrained tracking, achieving real-time performance on edge hardware.
Robust Counting Strategy: The introduction of a virtual counting band with temporal persistence to handle occlusions and jitter in dynamic environments.

4. Experimental Results

The system was evaluated on the MOT-RailwayPlatformCrowdHead (MOT-RPCH) benchmark.

Tracking Performance:
- MOTA (Multiple Object Tracking Accuracy): 67.19%
- IDF1 (Identity F1 Score): 76.32%
- Identity Switches (IDSW): Reduced to an average of 24.5 per sequence.
- Precision: 89.06%
Counting Performance:
- MAPE (Mean Absolute Percentage Error): 2.97% (significantly lower than baselines).
- MAE (Mean Absolute Error): 0.9
- RMSE: 1.36
Comparative Analysis:
- Compared to standard Constant Velocity (CV-8D) and Constant Acceleration (CA-12D) models, Phys-3D showed superior performance.
- CV-8D had a MAPE of 14.59% (due to failure to model deceleration/perspective).
- CA-12D had a MAPE of 6.99% (improved but still noise-sensitive).
- Phys-3D achieved the lowest error rates, proving that physics-based constraints outperform merely increasing kinematic model complexity.

5. Significance

Safety & Operations: Provides railway operators with reliable, real-time passenger density data, enabling adaptive dispatching, proactive safety management, and better platform crowd control.
Technical Advancement: Demonstrates that incorporating physical priors (geometry, ego-motion) into deep learning pipelines is more effective than relying solely on data-driven models for dynamic, non-stationary camera scenarios.
Deployment Viability: The system is optimized for real-time inference on edge hardware (e.g., NVIDIA T4 GPU), making it feasible for immediate deployment on trains without requiring heavy cloud processing.
Future Impact: Sets a new standard for transportation vision tasks, moving beyond static surveillance to handle the complexities of mobile, onboard perception.