Dual-Interaction-Aware Cooperative Control Strategy for Alleviating Mixed Traffic Congestion

Imagine a busy highway merging point as a crowded dance floor where everyone is trying to find a partner to move smoothly. In this scenario, you have two types of dancers:

Human-Driven Vehicles (HDVs): These are the dancers who follow their own rhythm. Some are aggressive (cutting in line), some are cautious (waiting too long), and they can't really talk to each other to coordinate. They often bump into each other or freeze up, causing a traffic jam.
Connected and Automated Vehicles (CAVs): These are the "smart" dancers. They can talk to each other and plan their moves in advance to keep the dance flowing.

The problem is that the smart dancers are currently surrounded by a chaotic crowd of human dancers. The smart cars don't know how to best cooperate because the humans are unpredictable.

This paper introduces a new "Super Coach" system called DIACC (Dual-Interaction-Aware Cooperative Control) to teach the smart cars how to dance perfectly, even in a chaotic crowd.

Here is how the Super Coach works, broken down into three simple tricks:

1. The "Two Different Ears" Trick (D-IADM)

Imagine a smart car wearing two different pairs of headphones.

Headphone A (The "Buddy" Ear): This listens only to other smart cars. Since they are all on the same team, they can share their future plans. "I'm going to slow down so you can merge!" This is cooperative.
Headphone B (The "Stranger" Ear): This listens to the human drivers. Since humans don't share their plans, the smart car can only watch their past movements. "That red car just swerved left, so I better be careful." This is observational.

Why it matters: Old systems treated all cars the same. This new system realizes that talking to a teammate is different than watching a stranger. By separating these two types of listening, the smart car makes much better decisions.

2. The "Big Picture" Coach (C-IEC)

In a normal game, a player only sees what's right in front of them. But in a traffic jam, a move that looks good for one car might cause a disaster three cars back.

The C-IEC is like a coach standing on a high tower with a drone view of the whole dance floor.

While the smart car (the player) is focused on its immediate neighbors, the Coach sees how the whole crowd is reacting.
The Coach tells the car: "Hey, slowing down now might feel safe for you, but it's going to cause a ripple effect that stops everyone behind you. Let's try a different move."
This helps the cars learn to cooperate not just for themselves, but for the entire traffic flow.

3. The "Focus on the Hard Stuff" Reward System

When learning to dance, it's easy to get good at the easy steps (like dancing in an empty room). But the real challenge is the crowded, chaotic part of the floor.

The paper's reward system is like a teacher who ignores the easy practice and only gives extra praise when you solve the hardest problems.

It uses a "temperature" dial. At first, the system explores everything.
As training continues, the dial turns down, and the system starts focusing intensely on the "hot spots"—the moments where cars are about to crash or get stuck.
This ensures the smart cars get really good at handling the most dangerous, crowded situations, rather than just being good at empty roads.

The Safety Net (PSAR)

Even with a great coach, sometimes a student might make a risky move. The paper includes a PSAR module, which is like a safety net or a referee.

If the smart car decides to make a move that looks too dangerous (like changing lanes too close to another car), the referee instantly steps in and says, "No, stop! Slow down instead."
This keeps the training safe and prevents accidents while the cars are learning.

The Result

When the researchers tested this system in a simulation:

Traffic flowed faster: The "dance floor" stayed moving, and fewer cars got stuck.
Fewer accidents: The "safety net" and better planning meant almost zero crashes, even in heavy traffic.
Better teamwork: The smart cars learned to work together so well that they could handle crowds of human drivers that usually cause gridlock.

In short: This paper teaches smart cars to listen differently to their teammates vs. strangers, gives them a coach with a global view, and forces them to practice the hardest moves first. The result is a traffic system that is safer, faster, and less frustrating for everyone.

Here is a detailed technical summary of the paper "Dual-Interaction-Aware Cooperative Control Strategy for Alleviating Mixed Traffic Congestion."

1. Problem Statement

The paper addresses the challenge of cooperative control for Connected and Automated Vehicles (CAVs) in mixed traffic bottleneck scenarios (e.g., lane reductions or merging areas) where CAVs coexist with Human-Driven Vehicles (HDVs).

Core Challenge: HDVs exhibit diverse, unpredictable, and non-cooperative driving behaviors, while CAVs must coordinate to optimize global traffic flow.
Limitations of Existing Methods:
- Rule-based/Optimization: High computational costs and difficulty modeling complex, uncertain human behaviors.
- Single-Agent RL: Struggles with global coordination and requires massive interaction data.
- Standard Multi-Agent RL (MARL): While better at coordination, existing decentralized MARL frameworks often fail to distinguish between CAV-CAV cooperative interactions (where agents share intent) and CAV-HDV observational interactions (where agents only observe trajectories). Furthermore, standard critics often lack the global perspective needed to evaluate how local interactions impact overall traffic dynamics.

2. Methodology: DIACC Strategy

The authors propose Dual-Interaction-Aware Cooperative Control (DIACC), a framework built upon the Multi-Agent Proximal Policy Optimization (MAPPO) algorithm. It integrates three key innovations to enhance both local decision-making and global value estimation.

A. Decentralized Interaction-Adaptive Decision-Making (D-IADM) Module

This module enhances the Actor network's ability to perceive local interactions by distinguishing between two types of neighbors:

Differentiation: It separates neighboring vehicles into HDV sets and CAV sets.
Dual Graph Attention:
- CAV-HDV Interaction: Modeled using a Graph Attention Network (GAT) based solely on historical trajectories and current states.
- CAV-CAV Interaction: Modeled using a separate GAT that incorporates previous decision feedback ( $a_{t-1}$ ), acknowledging that CAVs share intent and can coordinate.
Action Refinement (PSAR): A lightweight, rule-based Proactive Safety-based Action Refinement module sits between the Actor and execution. It monitors Time-To-Collision (TTC) and longitudinal gaps. If a proposed action violates safety thresholds, PSAR applies heuristic corrections (e.g., canceling a lane change, forcing a brake) to prevent collisions, while feeding the refined action back into the next observation step for learning.

B. Centralized Interaction-Enhanced Critic (C-IEC)

This module improves the Critic network's ability to evaluate global traffic states during the centralized training phase.

Integrated Traffic Dynamics Representation (ITDR): The critic constructs a global vehicle interaction graph connecting all vehicles.
Cross-Attention Mechanism: It uses a multi-head cross-attention mechanism where traffic features (global lane stats, road structure) act as the Query, and global interaction features (from the GAT) act as Key and Value.
Goal: This allows the critic to explicitly learn how specific vehicle interactions influence global traffic evolution, providing more accurate value estimates ( $V(s)$ ) to guide the Actor toward policies that benefit the system as a whole, not just the individual agent.

C. Cooperative Reward Mechanism with Softmin Aggregation

To prevent the training process from being dominated by agents in simple, low-interaction scenarios, the authors designed a specialized reward function:

Softmin Aggregation: The local ego rewards are aggregated using a softmin function rather than a simple average.
Temperature Annealing: A temperature parameter ( $\tau$ $τ$ ) is annealed from high to low during training.
- High $\tau$ (Early training): Uniform weighting encourages broad exploration.
- Low $\tau$ (Late training): Higher weights are assigned to agents with lower rewards (i.e., those in difficult, interaction-intensive scenarios).
Effect: This forces the policy to focus learning resources on the most challenging cooperative cases (e.g., dense merging), improving robustness.

3. Key Contributions

D-IADM Module: A novel actor architecture that explicitly models the heterogeneity of mixed traffic by separating CAV-CAV cooperative logic from CAV-HDV observational logic using dual graph attention networks.
C-IEC Module: A centralized critic design that utilizes cross-attention on a global interaction graph to capture the causal link between local interactions and global traffic dynamics, significantly improving value estimation accuracy.
Curriculum-Inspired Reward Design: A softmin-based reward mechanism with temperature annealing that dynamically shifts focus from exploration to optimizing difficult, interaction-intensive scenarios.
Safety Integration: The inclusion of the PSAR module ensures safety constraints are met during exploration, accelerating training convergence and reducing safety-critical events.

4. Experimental Results

The strategy was evaluated in a SUMO-based simulation of a 1.3 km bottleneck with varying capacity reductions (25% and 50%) and CAV penetration rates (20%–40%).

Training Performance:
- DIACC achieved the lowest collision rate and highest global reward compared to vanilla MAPPO, MAPPO-IADM (without C-IEC), and DIACC without PSAR.
- The annealing reward strategy proved superior to fixed-temperature settings, balancing exploration and exploitation effectively.
Testing Performance (Zero-Shot Generalization):
- Safety: DIACC reduced Safety-Critical Events (SCEs) to 0% across all tested zero-shot scenarios (different vehicle counts, driving styles, and road configurations), whereas baseline models still exhibited collisions.
- Efficiency: In high-density scenarios (N=30, N=40) and severe bottlenecks (50% reduction), DIACC significantly outperformed baselines in average speed and reduced Waiting Events (WEs).
- Ablation Insights:
  - Removing C-IEC (MAPPO-IADM) maintained good efficiency in simple scenarios but failed to eliminate collisions in high-density scenarios, proving the necessity of global interaction awareness.
  - Removing PSAR led to unstable training and higher collision rates during early learning phases.

5. Significance

Bridging the Gap: The paper successfully bridges the gap between local agent autonomy and global traffic optimization in mixed traffic, a critical step toward real-world CAV deployment.
Handling Uncertainty: By explicitly distinguishing between cooperative (CAV) and observational (HDV) interactions, the model adapts better to the unpredictability of human drivers.
Scalability and Robustness: The zero-shot testing demonstrates that the learned policy generalizes well to unseen traffic densities and road configurations, suggesting the framework is robust enough for real-world deployment.
Safety-First Learning: The integration of rule-based safety refinement (PSAR) within an RL framework offers a practical pathway to ensure safety during the training of autonomous agents, addressing a major hurdle in RL-based traffic control.

In conclusion, the DIACC strategy represents a significant advancement in MARL for traffic control, demonstrating that dual-level interaction awareness (local differentiation + global dynamics) is essential for alleviating congestion and ensuring safety in mixed traffic bottlenecks.