ExpressMind: A Multimodal Pretrained Large Language Model for Expressway Operation

Imagine a highway not just as a road with cars, but as a living, breathing organism that needs a brain to manage it. Right now, most highway management systems are like a team of specialized robots: one robot knows the traffic laws, another watches the cameras, and a third handles emergency calls. They don't talk to each other well. If a foggy day causes a pile-up, the "law robot" doesn't know the cameras are blurry, and the "camera robot" doesn't know the specific legal steps to take. They work in silos, which can lead to slow or confused reactions.

ExpressMind is the solution to this problem. Think of it as the super-intelligent "Air Traffic Controller" for highways, but instead of planes, it manages cars, trucks, and weather. It's a new kind of AI (a Multimodal Large Language Model) designed specifically to understand the chaotic, high-stakes world of expressways.

Here is how ExpressMind works, broken down into simple concepts:

1. The "Super-Student" Training (The Dataset)

Before ExpressMind could be a controller, it had to go to school. But instead of reading general books, it was fed a custom-made library that no one else had ever seen.

The Textbooks: It read millions of pages of traffic laws, engineering manuals, and emergency guides.
The Field Trips: It watched thousands of hours of real highway videos, learning what a "traffic jam" looks like versus a "construction zone."
The Drills: It practiced with real emergency reports, learning not just what happened, but why it happened and how to fix it.

2. The "Two-Step" Learning Process

The researchers didn't just dump all this data on the AI at once. They taught it in two stages:

Stage 1 (Absorbing Knowledge): Imagine a student reading a library of books to understand the basic rules of the road. ExpressMind learned the vocabulary and the "grammar" of traffic.
Stage 2 (Learning to Think): This is where it gets smart. The AI was taught to think like a human expert. Instead of just guessing, it was trained to follow a logical chain: See the accident → Analyze the cause → Decide the best action → Check if the action is safe.

3. The "Coach" (Reinforcement Learning)

How do you make sure the AI doesn't give dangerous advice? The researchers used a digital coach.

Every time the AI suggested a plan (like "close the left lane"), the coach checked: Is this safe? Is it logical? Did it follow the rules?
If the AI got it right, it got a "gold star" (a reward). If it made a mistake, it got a "red flag."
Over time, the AI learned to think like a seasoned highway safety expert, ensuring its decisions are always safe and practical.

4. The "Instant Library" (Graph-Augmented RAG)

AI models can sometimes forget new things or make things up (hallucinate). ExpressMind has a magic reference book that updates in real-time.

If a new traffic regulation is passed today, or if a specific bridge is closed right now, ExpressMind doesn't have to wait to be retrained.
It instantly "looks up" the latest facts in its digital graph library and uses them to answer questions. It's like having a GPS that knows about road closures the second they happen.

5. The "Super-Eyes" (Multimodal Vision)

Most AI can read text, but ExpressMind can watch and understand video.

It doesn't just see "pixels"; it understands the story of the video. It can look at a camera feed, see a car swerving, and immediately understand, "That's a tire blowout, not a driver distraction."
It uses a special technique called Visual-Prior Alignment. Imagine a detective looking at a crime scene. The AI is trained to pay extra attention to the visual clues (the skid marks, the smoke) before reading the report, ensuring it doesn't miss the most important visual details.

Why Does This Matter?

In the real world, ExpressMind is already being tested on highways in China. It acts as a central brain that can:

Spot trouble instantly: "Hey, there's a pile-up on the highway in foggy weather!"
Write the plan: "Close lane 1, send a tow truck, and warn drivers 5 miles back."
Explain the why: "We are closing lane 1 because the debris is blocking the exit ramp."

In short: ExpressMind is the first AI that truly "gets" highways. It combines the memory of a lawyer, the eyes of a security guard, and the decision-making skills of a traffic commander into one helpful, super-smart assistant. It turns chaotic highway data into clear, safe, and fast actions.

1. Problem Statement

Current expressway operations rely heavily on rule-based systems and isolated models, which limits the ability to perform joint analysis across different data sources. While general Large Language Models (LLMs) have advanced traffic modeling, they face significant challenges in the expressway domain:

Domain Knowledge Gap: General LLMs lack deep understanding of specialized regulations, technical standards, and causal relationships specific to expressway incidents.
Multimodal Limitations: Existing models struggle with extracting key visual features from complex traffic scenes (videos/images) and integrating them with textual reasoning.
Data Scarcity: There is a critical lack of high-quality, labeled multimodal data (text, video, incident reports) due to privacy constraints and the unstructured nature of operational data.
Safety & Reasoning: General models often fail to generate safe, logically consistent, and actionable incident response strategies, lacking the "Perception-Analysis-Decision-Reflection" cognitive loop required for emergency handling.

2. Methodology

The authors propose ExpressMind, a domain-specific Multimodal Large Language Model (MLLM) designed as the cognitive core for intelligent expressway operations. The framework consists of four key technical components:

A. Full-Stack Expressway Dataset Construction

To address data scarcity, the authors constructed the industry's first full-stack expressway dataset comprising:

Express-Insight: 7M+ tokens of text (laws, policies, textbooks) for unsupervised pre-training.
Express-QA: 870k+ QA pairs generated via DeepSeek-V3 for supervised fine-tuning.
Express-IncidentCoT: 1,786 Chain-of-Thought (CoT) samples derived from real-world incident reports, structured into four stages: Description $\to$ Causal Inference $\to$ Strategy Formulation $\to$ Evaluation.
Express-VQA: A multimodal dataset with 1,627 surveillance videos and 3,200+ VQA pairs covering diverse weather and traffic anomalies.

B. Dual-Layer Pre-training Paradigm

ExpressMind (based on the Qwen foundation) utilizes a two-stage pre-training strategy:

Unsupervised Training: Minimizes negative log-likelihood loss on domain-specific corpora to internalize foundational knowledge.
Full-Parameter Supervised Fine-Tuning (SFT): Uses a masked loss strategy to align the model with specific expressway tasks and instructions, focusing on response generation.

C. RL-Aligned Chain-of-Thought (RL-CoT)

To enhance reasoning and ensure safety, the authors employ Group Relative Policy Optimization (GRPO):

Mechanism: The model samples a group of candidate responses for a given incident query.
Reward Function: A multi-dimensional reward ( $R_{total}$ $R_{t o t a l}$ ) is calculated based on:
- Structural Integrity: Ensures the output follows the strict "Perception-Analysis-Decision-Reflection" order.
- Domain Alignment: Maximizes the use of expert terminology while penalizing linguistic degradation (via Perplexity constraints).
- Semantic Consistency: Measures cosine similarity between the model's logic and expert reference traces.
Goal: This forces the model to learn relative strategy superiority, reducing variance and ensuring outputs are actionable and safe.

D. Graph-Augmented Retrieval (RAG) & Multimodal Alignment

Graph-Augmented RAG: A dynamic knowledge base is built using LightRAG. Unstructured traffic data is converted into a knowledge graph ( $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ ) with a dual-layer retrieval mechanism (low-level for exact entity matching, high-level for semantic concept matching) to provide real-time context.
Visual-Prior Alignment (VPA): A novel cross-modal encoder integrates visual features with text.
- Uses MRoPE (Multi-Row Positional Embedding) for fine-grained temporal, height, and width encoding.
- Employs DeepStack for cross-layer feature fusion.
- Introduces VPA, a learnable attention reweighting mechanism that dynamically increases the weight of visual tokens during fusion, establishing a "visual prior" to prevent feature attenuation.

3. Key Contributions

First Full-Stack Expressway Dataset: A comprehensive resource covering text cognition, logical reasoning (CoT), and visual perception (VQA).
RL-Aligned CoT Reasoning: A GRPO-based framework that aligns model reasoning with expert heuristics, significantly improving logical consistency and safety compliance in incident response.
Graph-Augmented Retrieval: A dynamic, graph-based RAG system for real-time indexing and retrieval of critical expressway knowledge.
Visual-Prior Alignment (VPA): A mechanism to enhance multimodal understanding by explicitly prioritizing visual features in the attention mechanism.
Multi-modal Benchmark: The release of a standardized benchmark for evaluating LLMs in expressway scenarios (Knowledge QA, Incident Detection, Safety Response, Traffic Analysis).

4. Experimental Results

Extensive experiments were conducted on the newly released benchmark, comparing ExpressMind against state-of-the-art baselines (Qwen-32B, Llama-3.3-70B, DeepSeek-R1, etc.):

Knowledge QA: ExpressMind-14B achieved 98.4% accuracy in Expressway Laws & Regulations QA, outperforming Llama-3.3-70B (97.5%) and other baselines across all metrics (Accuracy, F1, GPT-Score).
RL Alignment Impact: The RL-aligned model showed significant improvements in Safety Compliance and Actionability (scores 8.0–9.0) compared to the pre-trained-only version. It also reduced reasoning latency to 13.2 ms (24.6% faster than Baichuan-32B) with minimal jitter.
Knowledge Retrieval: Integrating the Graph-RAG increased the occurrence probability of professional vocabulary by 16.7% and improved F1-scores in regulatory understanding.
Multimodal Understanding (ExpressMind-VL):
- Outperformed generalist MLLMs (VideoLLaMA3, InternVL3.5, Qwen3-VL) in scene description metrics (BLEU-4, ROUGE-L, CIDEr).
- Achieved >90% accuracy and recall in detecting six core traffic incidents (e.g., illegal parking, congestion, pedestrian intrusion) on real-world surveillance data.
Ablation Studies: Confirmed that removing RL, RAG, or VPA components significantly degrades performance in reasoning, factual accuracy, and visual detection, respectively.

5. Significance

Domain Transformation: ExpressMind bridges the gap between general AI and specialized transportation operations, moving from reactive rule-based systems to proactive, cognitive agents.
Safety & Reliability: By enforcing expert-aligned reasoning and real-time knowledge retrieval, the model ensures that incident response strategies are not just linguistically plausible but operationally safe and compliant.
Real-World Deployment: The system has been deployed in the Shandong Expressway Cloud Brain, serving as an intelligent hub for incident monitoring, analysis, and automated report generation.
Open Science: The release of the first full-stack expressway dataset and benchmark sets a new standard for research in intelligent transportation systems (ITS) and domain-specific MLLMs.