Adversarial Patch Generation for Visual-Infrared Dense Prediction Tasks via Joint Position-Color Optimization

Imagine you have a super-smart security guard who can see in two ways at once: with normal eyes (seeing colors and textures) and with night-vision goggles (seeing heat and shapes in the dark). This guard is used to make important decisions, like counting people in a crowd, finding lost items, or combining two pictures into one perfect image.

This paper is about how to trick this "super-guard" using a single, cleverly designed sticker.

The Problem: The "One-Size-Fits-None" Sticker

Previously, hackers knew how to make a sticker that would confuse a guard with just normal eyes. They'd put a bright, weirdly colored patch on a person's shirt, and the guard would think, "That's not a person; that's a giant dog!"

But this new "super-guard" sees both normal vision and infrared (heat) vision.

The Issue: If you make a sticker that looks like a chaotic rainbow to the normal eye, it might look like a weird, glowing blob to the night-vision eye.
The Result: The normal eye gets confused, but the night-vision eye says, "Wait, that doesn't look right either," and the guard ignores the trick. The attack fails because the two "eyes" don't agree on what the sticker is.

The Solution: The "Chameleon Sticker" (AP-PCO)

The authors of this paper invented a new way to make these stickers. They call it AP-PCO. Think of it as a Chameleon Sticker that changes its personality depending on who is looking at it, all while being the same physical object.

Here is how they did it, using simple analogies:

1. The "Evolutionary Search" (Finding the Perfect Spot)

Instead of guessing where to put the sticker, the computer acts like a biologist watching a colony of ants.

It releases thousands of "virtual ants," each carrying a slightly different sticker idea (different sizes, different spots on the image).
It tests them all against the security guard.
The ones that confuse the guard the most get to "reproduce" (their ideas are mixed and mutated).
Over time, the colony evolves to find the perfect spot where the sticker causes the maximum confusion. It's like natural selection, but for stickers.

2. The "Dual-Personality" Color (The Magic Trick)

This is the coolest part. The sticker needs to look different to the two eyes, but it's only one physical piece of paper.

To the Normal Eye: The sticker is bright, high-contrast, and colorful. It screams "Look at me!" to mess up the texture recognition.
To the Night-Vision Eye: The computer takes those same colors and turns them into grayscale (black and white) and dims them down.
The Analogy: Imagine a sticker that looks like a neon sign to a human, but to a thermal camera, it looks like a subtle shadow that blends perfectly into the background. The computer figures out exactly which colors create this "double vision" effect.

3. The "Black Box" Rule

The researchers didn't need to know how the security guard's brain worked inside. They didn't need the blueprints. They just threw stickers at the guard, saw if the guard got confused, and adjusted the stickers based on the result. This makes the attack very hard to stop because you don't need inside information to pull it off.

Why Does This Matter?

The researchers tested this on three real-world jobs:

Crowd Counting: Making the system think there are 100 people when there are only 10.
Semantic Segmentation: Making the system think a person is a tree or a car.
Image Fusion: Blending two pictures together poorly so the final image is useless.

The Results:

Their "Chameleon Sticker" worked much better than old methods.
It worked on different types of security guards (different AI models).
It was hard to spot (stealthy).
Even if you tried to blur the image or compress it (like standard security defenses), the sticker still worked.

The Takeaway

This paper is a wake-up call. It shows that as we build smarter systems that combine different types of sensors (like cameras and heat sensors), we need to be careful. A single, cleverly designed physical object can trick these advanced systems just as easily as it tricks a human, perhaps even more so.

The authors aren't trying to break the world; they are holding up a mirror to show us where the cracks are, so we can build stronger, safer AI guards for the future.

1. Problem Statement

The paper addresses a critical security gap in Visual-Infrared (VI) dense prediction systems. While deep learning models for tasks like crowd counting, semantic segmentation, and image fusion have achieved high performance, their robustness against adversarial attacks remains underexplored, particularly in multimodal settings.

Limitations of Existing Methods: Current adversarial patch attacks are primarily designed for single-modal (visible) inputs. When applied directly to VI systems, they fail due to:
- Spectral Heterogeneity: Visible and infrared images have different intensity distributions and spectral characteristics. A patch optimized for visible light often creates unnatural artifacts or fails to perturb the infrared modality.
- Decoupled Optimization: Existing methods often optimize patch position and color sequentially or loosely, failing to account for the complex interaction between spatial placement and spectral appearance in dense prediction tasks.
- Lack of Generalization: Many methods rely on task-specific cues (e.g., bounding boxes) or internal model gradients (white-box), limiting their applicability to black-box scenarios and diverse dense prediction architectures.

2. Methodology: AP-PCO Framework

The authors propose AP-PCO (Adversarial Patch via Position-Color Optimization), a black-box, joint optimization framework designed to generate a single adversarial patch effective against both visible and infrared modalities simultaneously.

A. Global Optimization Strategy

Instead of relying on gradient-based updates (which are unstable due to the discrete nature of patch masks), the method employs Differential Evolution (DE), a population-based global search algorithm.

Parameter Space: Each individual in the population encodes a unified vector containing:
- Spatial Parameters: Center coordinates $(x, y)$ and radius $r$ defining the circular patch mask.
- Color Parameters: A list of RGB triplets defining the patch texture.
Process: The algorithm iteratively evolves the population through mutation, crossover, and selection to find the optimal patch configuration that maximizes the fitness function.

B. Cross-Modal Color Adaptation Strategy

To bridge the gap between visible and infrared domains, the authors introduce a Cross-Modal Color Reuse Strategy:

Visible Domain: The color parameters are applied directly to create high-brightness, high-contrast regions that disrupt texture and color cues.
Infrared Domain: The same color parameters are converted to grayscale and compressed in intensity. This ensures the patch blends naturally with the grayscale nature of infrared images, avoiding the "artifacts" seen when visible-domain colors are naively applied to IR data.
Mechanism: This is implemented via mask multiplication, grayscale compression, and background superposition.

C. Fitness Function

The optimization is guided by a fitness function $J$ that balances Attack Effectiveness ( $E$ ) and Stealthiness ( $S$ ):
$J = \alpha \cdot E(X_{adv}) + (1 - \alpha) \cdot S(X_{adv})$

Effectiveness ( $E$ ): Defined by task-specific metrics:
- Crowd Counting: People count error (GAME, RMSE).
- Semantic Segmentation: Reduction in mIoU.
- Image Fusion: Gradient loss, intensity loss, and SSIM degradation.
Stealthiness ( $S$ ): Measured using PSNR and SSIM between original and adversarial images for both modalities. (Note: For image fusion, stealthiness is not explicitly weighted as it overlaps with fusion quality metrics).

3. Key Contributions

Joint Position-Color Optimization: The paper formulates VI patch attacks as a joint spatial-spectral optimization problem, solving the challenge of finding effective attack regions without relying on task-specific proposals (like bounding boxes).
Cross-Modal Color Adaptation: A novel strategy that reuses color parameters across modalities by adapting them to infrared grayscale characteristics, significantly improving stealthiness in the IR domain while maintaining attack strength in the visible domain.
Black-Box Applicability: The method operates without internal model gradients, making it applicable to a wide range of black-box VI dense prediction models.
Comprehensive Evaluation: The framework is validated across three distinct tasks (crowd counting, semantic segmentation, image fusion) and multiple state-of-the-art models.

4. Experimental Results

The authors conducted extensive experiments on three public datasets (RGBT-CC, MF, RoadScene) targeting models like BL+IADM, Openress, and Res2Fusion.

Attack Effectiveness:
- Crowd Counting: AP-PCO increased the GAME error from ~13.7 (clean) to 40.55, significantly outperforming single-modal adapted methods (PAP, APAM) which only reached ~14-15.
- Semantic Segmentation: The method reduced mIoU drastically (e.g., from 52.52 to 1.58 on FEANet), whereas random patches had minimal impact.
- Image Fusion: The patches successfully degraded fusion quality metrics (Qabf, PSNR, SSIM) across all tested fusion networks.
Stealthiness: The cross-modal color strategy maintained high PSNR and SSIM values in the infrared domain, preventing the "glaring" artifacts typical of naive single-modal attacks.
Ablation Studies:
- Joint vs. Separate: Joint optimization of position and color yielded superior results compared to optimizing them separately.
- Color Reuse: Using the reuse strategy significantly improved IR stealthiness (higher PSNR/SSIM) with only a marginal trade-off in attack strength.
- Modality: Simultaneous VI attacks were more effective than attacking visible or infrared modalities individually.
Robustness: The attacks remained effective against common defenses like JPEG compression, Median Filtering, and MSE-based anomaly detection.
Physical Validation: Real-world experiments in a laboratory corridor confirmed that the digital patches translate effectively to physical attacks on VI perception systems.

5. Significance and Conclusion

This work highlights a critical vulnerability in multimodal AI security. It demonstrates that existing VI perception systems are highly susceptible to physical adversarial patches, even when the attacker has no knowledge of the internal model architecture.

Theoretical Impact: It establishes that spatial and spectral parameters in multimodal attacks are intrinsically coupled and must be optimized jointly to achieve high effectiveness and stealth.
Practical Implication: The proposed AP-PCO serves as a robust benchmark for evaluating the security of VI systems. The findings suggest that current defense mechanisms are insufficient against cross-modal attacks, necessitating new robust training strategies for multimodal perception.
Future Work: The authors note that physical factors like varying camera angles and illumination conditions in real-world scenarios remain a challenge for future investigation.