RGB-Event HyperGraph Prompt for Kilometer Marker Recognition based on Pre-trained Foundation Models

🚇 The Problem: The "Blind" Train

Imagine a metro train zooming through a tunnel at 60 miles per hour. It's dark, the lights are flickering, and maybe it's raining outside. The train needs to know exactly where it is to stop safely at the next station.

Usually, trains use cameras (like the ones in your phone) to read "Kilometer Markers" (signs on the wall that say "Station 5," "Station 6," etc.). But in this chaotic environment, regular cameras get confused.

Too dark? The camera sees a black void.
Too bright? The camera gets blinded by the sun.
Moving too fast? The image turns into a blurry smear.

It's like trying to read a street sign while driving through a heavy fog at night with your headlights on full beam. You just can't see the details.

🧠 The Solution: Giving the Train "Super-Senses"

The researchers realized that regular cameras aren't enough. So, they gave the train a second pair of eyes: an Event Camera.

Think of a regular camera like a film camera that takes a picture every second. It captures everything, even the boring, static parts of the scene.
Think of an Event Camera like a hyper-alert security guard. It doesn't take pictures of the whole room. Instead, it only shouts out when something changes.

If a light flickers? Shout!
If a sign moves past? Shout!
If the train speeds up? Shout!

This "Event Camera" is amazing in the dark and at high speeds because it ignores the static darkness and only focuses on the movement and changes. It's like having a night-vision goggles that only highlights moving objects.

🤝 The Team-Up: The "Hyper-Graph" Dance

The paper's big idea is to make the regular camera (the "Visual") and the Event camera (the "Alert") work together perfectly. They call this RGB-Event Fusion.

But just gluing the two images together isn't enough. You need a smart way to mix their information. The researchers invented a method called HGP-KMR (HyperGraph Prompt).

Here is the analogy:
Imagine you are trying to solve a puzzle.

The Regular Camera is a friend who sees the colors and shapes clearly but gets confused by the blur.
The Event Camera is a friend who sees the motion and edges perfectly but doesn't see the colors.
The HyperGraph is a super-smart project manager sitting between them.

Instead of just saying, "Here is my picture," the project manager creates a 3D map of connections (a HyperGraph) between the details the first friend sees and the movements the second friend sees. It asks: "Hey, that blurry shape the first friend sees? It matches that sharp edge the second friend saw moving! Let's combine them!"

This "manager" then whispers these combined clues back to the main brain (the AI model) to help it read the sign. This is the "Prompt" part—it's like giving the AI a helpful hint before it tries to read the text.

📚 The New Textbook: EvMetro5K

To teach their AI how to do this, the researchers couldn't just use old photos. They needed a new textbook.

They built a special rig with both cameras on a real train.
They drove it through tunnels, in the rain, and in the sun for 20 hours.
They created a new dataset called EvMetro5K, which contains 5,599 pairs of "Regular Photo" + "Event Alert."

It's like creating a new language textbook specifically for "Train Reading," filled with examples of blurry signs, dark tunnels, and rainy days.

🏆 The Results: Reading in the Dark

When they tested their new system:

On the new dataset: It got 95.1% accuracy. That's huge! The old methods (using just regular cameras) were stuck around 84%.
On other tests: Even on standard text recognition tests (like reading artistic handwriting), their method was the best.

Why is this a big deal?

Safety: Trains can now know exactly where they are, even if the GPS fails or the tunnel is pitch black.
Efficiency: The system is surprisingly small and fast. It doesn't need a supercomputer; it can run on standard hardware.
Future-Proof: They made the code and the data public, so other scientists can build on this to make trains even smarter.

🎯 The Bottom Line

This paper is about teaching a train to read signs in the worst possible conditions by giving it two types of eyes and a smart brain that knows how to mix the information from both. It's like upgrading a driver from having just "eyes" to having "eyes plus a motion-sensing radar," all working together to ensure the train never misses its stop.

1. Problem Statement

Context: Metro trains operate in complex environments characterized by extreme lighting variations (dim tunnels, overexposed sunlight), high-speed motion, and adverse weather.
Challenge: Conventional RGB cameras struggle in these conditions due to motion blur, low-light noise, and dynamic range limitations (approx. 60 dB). This severely degrades the performance of Kilometer Marker Recognition (KMR), a critical task for autonomous metro localization in GNSS-denied environments.
Gap: Existing Scene Text Recognition (STR) methods rely heavily on RGB data or single-modal event data. Few approaches effectively integrate multi-modal data (RGB + Event) using pre-trained foundation models to handle the specific high-order interactions required for robust recognition in extreme metro scenarios.

2. Methodology: HGP-KMR

The authors propose HGP-KMR (HyperGraph Prompt for Kilometer Marker Recognition), a framework built upon pre-trained foundation models (specifically PARseq) that fuses RGB and Event data.

A. Data Preprocessing & Input

Event Reconstruction: Asynchronous event streams from the event camera are reconstructed into grayscale images using an events-to-grayscale algorithm.
Pre-cropping: Both RGB frames and reconstructed grayscale images are cropped around the kilometer markers to remove background interference and resized to a fixed resolution (e.g., $32 \times 128$ ).
Tokenization: Both modalities are projected into discrete token sequences with positional encodings.

B. Network Architecture

The framework consists of three main components:

Backbone (ViT-based):
- Uses a Vision Transformer (ViT) backbone.
- The Event Branch processes event tokens through a stacked ViT encoder to extract initial event representations.
- The RGB Branch uses a shared ViT backbone to extract RGB features.
HyperGraph Prompt Module (Core Innovation):
- Construction: RGB and event features are concatenated along the channel dimension. A hypergraph structure ( $G$ ) is constructed based on Euclidean distance K-Nearest Neighbors (K-NN) to capture high-order relationships between nodes (tokens) across modalities.
- Aggregation: A two-layer Hypergraph Convolutional Network (HGCN) aggregates features to generate multi-modal graph representations.
- Prompting Strategy: Instead of simple concatenation, the aggregated hypergraph features are injected into the RGB backbone via a layer-wise residual addition strategy. This acts as a "prompt," guiding and modulating the RGB feature extraction at every layer of the ViT, allowing the RGB branch to dynamically incorporate event-derived cues.
Vision-Lingual Decoder:
- Utilizes a pre-LayerNorm Transformer decoder (similar to PARseq).
- Employs permutation-based autoregressive modeling (training on random permutations of the context sequence) to enhance sequence understanding and generalization.
- Outputs the final text sequence via a linear projection layer.

3. Key Contributions

EvMetro5K Dataset:
- The first large-scale, synchronized RGB-Event dataset specifically for metro kilometer marker recognition.
- Contains 5,599 paired samples (4,479 training, 1,120 testing) collected over 20+ hours in real metro scenarios.
- Captures diverse conditions: varying weather, times of day, and high-speed motion.
- Includes synchronized RGB, Near-Infrared (NIR), and Event data.
HGP-KMR Framework:
- Proposes a novel HyperGraph Prompt mechanism that leverages pre-trained foundation models.
- Unlike standard fusion methods, it uses hypergraphs to model high-order interactions and injects these features as prompts into the backbone layers, significantly enhancing robustness in noisy/degraded conditions.
Comprehensive Benchmarking:
- Established a new benchmark on EvMetro5K and validated the method on public datasets (WordArt*, IC15*), demonstrating superior performance over state-of-the-art (SOTA) STR models.

4. Experimental Results

The method was evaluated on EvMetro5K, WordArt*, and IC15*.

Performance on EvMetro5K:
- HGP-KMR achieved 95.1% accuracy, outperforming the best baseline (MGP-STR at 92.3%) and the base PARseq model (91.7%).
- Compared to the RGB-only PARseq baseline, the proposed method improved accuracy by +3.4%.
- Ablation Studies:
  - Modalities: RGB+Event Grayscale (95.1%) significantly outperformed RGB only (84.2%) and Event only (84.6%).
  - Fusion: The HyperGraph Prompt strategy (95.1%) outperformed simple addition (91.7%), concatenation (93.5%), and standard HyperGraph Fusion (93.3%).
  - GNN Architecture: HGCN (95.1%) performed better than GraphSAGE (93.2%) and GATConv (93.7%).
Performance on Public Benchmarks:
- WordArt:* Achieved 91.5% accuracy (using CDistNet backbone), surpassing MGP-STR (+11%) and SIGA (+20.6%).
- IC15:* Achieved 92.9% accuracy, demonstrating robustness across diverse natural scenes.
Efficiency:
- The model is highly parameter-efficient, containing only 24.2M parameters (an increase of only 0.8M over the base PARseq).
- It uses significantly fewer parameters than MGP-STR (148M) and CDistNet (65.5M).
- Inference speed is 89 FPS, balancing high accuracy with real-time capability.

5. Significance

Robustness in Extreme Conditions: The paper demonstrates that integrating event cameras with RGB data via hypergraph prompting effectively mitigates the limitations of RGB sensors in low-light, high-speed, and overexposed environments.
Foundation Model Adaptation: It successfully adapts large-scale pre-trained foundation models (originally designed for RGB) to multi-modal scenarios without requiring massive retraining, proving the viability of "prompting" strategies for sensor fusion.
Resource for Future Research: The release of EvMetro5K fills a critical gap in the literature, providing the first standardized benchmark for multi-modal metro localization and text recognition, facilitating future research in autonomous rail systems.
Practical Application: The high accuracy and efficiency of HGP-KMR offer a viable solution for precise metro localization in GNSS-denied tunnels, directly contributing to the safety and operational efficiency of smart city transportation infrastructure.