Depth-Enhanced YOLO-SAM2 Detection for Reliable Ballast Insufficiency Identification

Imagine a railway track as a giant, heavy-duty bed. The rails are the mattress, and the ballast (the crushed rocks underneath) is the fluffy pillow and mattress support. If that pillow gets squished, missing chunks, or sinks too low, the bed becomes unstable. Trains can't sleep safely on a broken bed; they might derail or break.

For a long time, checking this "pillow" meant sending a human inspector to walk the tracks, squinting at the rocks, and guessing if there was enough. It's dangerous, tiring, and everyone guesses a little differently.

This paper introduces a robotic inspector that uses a special "3D vision" system to check the ballast automatically. Here is how it works, broken down into simple steps:

1. The Problem: The "Flat" Camera Lie

The researchers first tried using a standard camera (like the one on your phone) to look at the rocks.

The Analogy: Imagine looking at a pile of sand from above. If the sand is uneven, a flat photo makes it look like a smooth, perfect hill. You can't tell if there are deep holes or missing chunks just by looking at the colors.
The Result: The computer got really good at saying, "Yes, there are rocks here!" (High Precision), but it was terrible at saying, "Oh no, this pile is too low!" (Low Recall). It kept missing the dangerous spots because it couldn't see the depth.

2. The Solution: Giving the Robot "3D Glasses"

To fix this, they added a RealSense camera, which is like giving the robot 3D glasses. It doesn't just see the color of the rocks; it sees how far away they are.

The Catch: These 3D cameras are a bit like cheap 3D movies; sometimes the image gets warped or tilted, making a flat surface look like a slanted hill. If you don't fix this, the robot thinks the rocks are missing when they are actually fine.

3. The "Magic" Fix: Straightening the Warped View

The team invented a clever math trick to "un-warp" the 3D image.

The Analogy: Imagine looking at a reflection in a funhouse mirror. The mirror distorts your face. To fix it, the researchers used the sleepers (the wooden or concrete beams the rails sit on) as a ruler. They know sleepers are supposed to be flat and straight.
The Process: The computer looks at the sleepers, sees how the mirror is warping them, and then uses a mathematical "smoothing" filter to straighten the image back to reality. Now, the robot sees the true height of the rocks.

4. The "Rotated" Glasses: Following the Train Tracks

Railway tracks aren't always straight lines in a photo; they curve and angle.

The Problem: Standard computer vision draws boxes around objects like a grid (upright squares). If a rock pile is on a curve, a square box cuts off the corners or includes too much empty space.
The Fix: They used a new AI tool called SAM2 (Segment Anything Model). Think of this as a smart highlighter. Instead of drawing a square, it draws a rotated box that perfectly hugs the shape of the rocks, no matter how the track curves. This ensures the robot measures only the rocks, not the empty space next to them.

5. The Final Check: Two Ways to Spot Danger

Once the robot has a clean, 3D, perfectly aligned view of the rocks, it uses two rules to decide if the ballast is "insufficient" (dangerous):

The "Sinking Pool" Rule: Is the whole area of rocks lower than it should be? (Like a pool of water that's too shallow).
The "Edge Gap" Rule: Are there specific holes right next to the sleepers? (Like a pillow that has been pulled away from the headboard, leaving a gap).

The Result: A Safer Railway

When they tested this new system against the old "flat photo" method:

Old Method: Missed almost half of the dangerous spots (Low Recall). It was like a security guard who only catches the bad guys when they are wearing a bright red hat, but ignores everyone else.
New Method: Caught 80% of the dangerous spots (High Recall) while still being very accurate. It's like a security guard who notices anyone acting suspiciously, even if they are hiding in the shadows.

In a nutshell: This paper teaches a computer how to stop guessing and start measuring. By fixing the camera's distortion and wrapping the rocks in custom-fit boxes, the system can now reliably spot missing rocks before they cause an accident, making train travel much safer.

1. Problem Statement

The paper addresses the critical safety issue of railway ballast insufficiency. Ballast (crushed stone) supports railroad ties, distributes train weight, and ensures drainage. Traditional inspection relies on manual visual checks, which are labor-intensive, subjective, and unsafe.

While automated computer vision methods exist, they face two primary limitations when using standard RGB (color) cameras:

Geometric Ambiguity: RGB data lacks depth information, making it difficult to distinguish between visually sufficient ballast and actual physical insufficiency (e.g., low fill levels).
Sensor Distortion: Depth sensors (specifically Intel RealSense) suffer from spatial biases, tilt, and warping due to environmental conditions and sensor orientation. Uncorrected depth data leads to unreliable geometric analysis.

The specific challenge is to develop a system that can automatically detect insufficient ballast with high recall (minimizing missed dangerous cases) while maintaining high precision, overcoming the tendency of RGB-only models to over-predict "sufficient" conditions.

2. Methodology

The authors propose a novel Depth-Enhanced YOLO–SAM2 framework that integrates deep learning detection, precise segmentation, and robust geometric depth correction. The pipeline consists of four main stages:

A. YOLO-Based Ballast Detection

Input: RGB frames from a top-down RealSense D435 camera.
Process: YOLOv8 is used to localize ballast regions. To reduce false positives, detection is constrained to the central 70% of the image (between the rails).
Output: Initial axis-aligned bounding boxes (AABB) serving as Regions of Interest (ROIs).

B. SAM2 Segmentation and Rotated Bounding Box (RBB) Extraction

Refinement: The initial ROIs are processed by Segment Anything Model 2 (SAM2). Instead of segmenting the whole image, SAM2 is prompted with the YOLO ROI to generate precise binary masks for individual ballast segments.
Alignment: Since ballast regions follow the track geometry (often at an angle), axis-aligned boxes are insufficient. The system computes a Rotated Minimum-Area Bounding Box for each mask. This ensures the sampling area aligns perfectly with the physical orientation of the track and sleepers.

C. Robust Depth Correction

To address sensor bias (Challenge-II), the authors introduce a novel correction pipeline:

Sleeper Sampling: Depth samples are extracted exclusively from the sleeper surfaces (the concrete ties between ballast segments), as these are known to be planar in reality.
Polynomial Bias Modeling: A low-order 2D polynomial surface ( $\Delta z$ ) is fitted to model the spatial distortion (tilt and curvature) of the raw depth map.
RANSAC Estimation: Random Sample Consensus (RANSAC) is used to robustly fit the polynomial model, filtering out outliers caused by noise or occlusion.
Temporal Smoothing: An Exponential Moving Average (EMA) filter is applied to the bias parameters across frames to ensure smooth transitions and reduce flicker.
Correction: The estimated bias surface is subtracted from the raw depth map to produce a corrected depth map ( $D_{corr}$ ).

D. Plane Reconstruction and Dual-Criteria Classification

Using the corrected depth and rotated boxes, the system classifies ballast sufficiency via two geometric criteria:

Reference Plane Construction: A local reference plane is reconstructed for each ballast region by linearly interpolating between the depth profiles of the top and bottom sleeper edges.
Dual-Metric Classification:
- Criterion 1 (Global Residual): Calculates the proportion of pixels where the ballast depth falls significantly below the reference plane (detecting widespread depression).
- Criterion 2 (Edge Gap): Analyzes the "edge bands" near the sleeper interfaces for localized gaps (detecting localized ballast loss).
- Decision Logic: A region is classified as "insufficient" if either criterion is triggered, or if the YOLO confidence suggests insufficiency (Logical OR rule).

3. Key Contributions

Integrated RGB-D Pipeline: A novel framework combining YOLOv8 for detection, SAM2 for precise mask refinement, and Rotated Bounding Boxes (RBB) tailored to railway geometry.
Robust Depth Correction: A calibration-free method using RANSAC-based polynomial fitting and temporal smoothing to correct RealSense spatial distortions using sleeper surfaces as ground truth.
Dual-Criteria Classifier: A geometric classification strategy that jointly evaluates global depth residuals and localized edge gaps, significantly improving reliability over single-metric approaches.

4. Experimental Results

The system was evaluated on a dataset of 1,405 training images and 418 test images collected from real railroad tracks.

Baseline Performance (YOLO-Only): Achieved high precision (0.99) but very low recall (0.49) for insufficient ballast. The model tended to over-predict "sufficient," missing many dangerous cases (F1-score: 0.66).
Proposed Method Performance:
- By integrating depth correction and rotated bounding boxes, the system achieved a Recall of 0.80 (up from 0.49) and an F1-score of 0.80+ (up from 0.66).
- The best configuration (CD-YOLO-SAM2-RBB with all three criteria: Global, Edge, and YOLO) achieved a Precision of 0.86, Recall of 0.75, and F1-score of 0.80.
Key Finding: The depth-enhanced approach significantly reduced false negatives (missed detections), which is critical for safety, while maintaining high precision.

5. Significance

This work demonstrates that integrating geometric depth analysis with state-of-the-art segmentation is essential for reliable infrastructure inspection.

Safety Impact: By shifting from RGB-only to depth-enhanced analysis, the system mitigates the risk of overlooking insufficient ballast, a major cause of track instability.
Robustness: The depth correction method allows the system to function reliably despite sensor tilt and environmental noise without external calibration targets.
Future Application: The framework provides a scalable foundation for automated, train-mounted inspection systems, potentially extending to other curved or complex track geometries through multi-camera fusion.