Mask-aware inference with State-Space Models

Imagine you are trying to solve a puzzle, but someone has cut out random, jagged pieces of the picture and replaced them with blank white paper. You still need to see the whole image to understand what it is, or to fill in the missing parts.

This is exactly the problem computer vision systems face when dealing with real-world data. Sensors (like those on self-driving cars) often have "blind spots," or images might have parts blocked out for privacy.

This paper introduces a new way for AI to handle these "blank spots" without getting confused. Here is the breakdown using simple analogies:

The Problem: The "Blind" AI

Most modern AI models (like the popular Mamba architecture) are like a very fast, efficient reader who reads a book page by page.

The Issue: If the reader encounters a blank page or a page with gibberish (the "invalid data"), they try to read it anyway. They treat the blank space as if it contains important information.
The Result: The reader gets confused, their understanding of the story gets corrupted, and they make mistakes. In the past, this was solved for older AI models (CNNs) by telling them, "Ignore the blank spots and only read the words that are there." But the new, faster Mamba models didn't have this "ignore" button built-in.

The Solution: Partial Vision Mamba (PVM)

The authors created a new tool called Partial Vision Mamba (PVM). Think of PVM as giving the AI a pair of smart glasses and a special highlighter.

Here is how it works, step-by-step:

1. The "Smart Glasses" (The Mask)

Before the AI even looks at the image, it puts on glasses that show a red "X" over every blank or broken spot and a green checkmark over every valid spot. This is called a Mask. The AI now knows exactly where the data is missing.

2. The "Partial Patch" (The Puzzle Piece)

The AI breaks the image into small square tiles (patches) to process them.

Old Way: If a tile had even one pixel of "blank paper," the AI would treat the entire tile as garbage and throw it away, or worse, try to guess what was there and get it wrong.
PVM Way: The AI looks at the tile. If it has any valid pixels, it says, "Okay, this tile is useful!" It uses a special trick (called Partial Linear Projection) to average out the blank spots so they don't mess up the math. It effectively says, "I'll only listen to the voices I can hear in this room, and ignore the silence."

3. The "Secret Code" (Learned Tokens)

What happens if a whole tile is just blank paper? The AI can't just skip it, or it loses its place in the story.

The Fix: PVM replaces the blank tile with a special "placeholder token." Think of this like a librarian putting a specific "Out of Order" sign on a broken book. The AI learns that this sign means "Ignore this, but keep the flow going." It doesn't let the broken part contaminate the rest of the story.

Why is this a big deal?

The authors tested this "smart glasses" system on three very different jobs:

Depth Completion (The 3D Map): Imagine a self-driving car trying to build a 3D map of the road, but its laser scanner is missing data in the middle of the road.
- Without PVM: The car thinks the missing data is a flat road or a wall, leading to a crash.
- With PVM: The car ignores the missing spots and builds a perfect map using only the data it actually has. The paper showed this improved accuracy by 23%.
Image Inpainting (The Art Restorer): Imagine a famous painting with a hole in the middle. You want the AI to paint over the hole to match the rest of the picture.
- Without PVM: The AI gets confused by the hole and paints a blurry mess or weird lines.
- With PVM: The AI focuses only on the valid parts of the painting to guess what belongs in the hole, creating a much more realistic result.
Image Classification (The Security Guard): Imagine a security camera trying to identify a person, but their face is covered by a large sticker.
- Without PVM: The AI sees the sticker and thinks, "I don't know what this is," or guesses wrong.
- With PVM: The AI looks at the visible parts (the shoulders, the shirt, the hair) and correctly identifies the person, ignoring the sticker entirely. This improved accuracy by 36%.

The Bottom Line

This paper is like inventing a new rule for a game: "If a piece of the board is missing, don't try to guess what's under it; just play with the pieces you have."

By teaching the new, fast AI models (Mamba) how to ignore broken data instead of trying to force it to make sense, the authors have made these models much more robust, accurate, and ready for the messy, imperfect real world.

1. Problem Statement

Many real-world computer vision tasks involve inputs with arbitrarily shaped regions of missing or invalid data (e.g., sparse LiDAR points, anonymized regions, or occluded objects).

The Limitation of Standard Models: Traditional architectures like CNNs and the newer State-Space Models (SSMs), specifically Mamba and its vision variants (Vision Mamba, VMamba), are designed for fully valid inputs. They treat "invalid" pixels (often filled with zeros or placeholders) as valid data. This corrupts feature extraction, alters hidden states, and leads to performance degradation or failure.
The Gap: While Partial Convolutions (PConvs) successfully solved this for CNNs by re-normalizing outputs based only on valid pixels, there was no equivalent mechanism for SSMs. Existing Masked Image Modeling (MIM) strategies are pre-training techniques, not inference-time architectural solutions that inherently ignore invalid data.

2. Methodology

The authors propose a comprehensive framework to enable Mask-aware inference in SSM-based architectures.

A. The Mask-Aware Framework

The authors define a formal framework where the input is a tuple $(x, m)$ , where $x$ is the data tensor and $m$ is a boolean validity mask. Key design principles include:

Dynamic Mask Propagation: The validity mask is not static; it is updated as data flows through the network.
Operation Rules:
- Element-wise/Concatenation: Validity is the logical AND of input masks (an output is valid only if all contributing inputs are valid).
- Receptive Field Operations (Conv/FC/SSM): In standard layers, if any input in the receptive field is invalid, the output is invalid. In Partial layers, if at least one input is valid, the output is valid.
- Sequence Modeling: For SSMs, the output is corrupted if any token in the sequence history is invalid. A partial mechanism requires only one valid token to produce a valid output.

B. Partial Vision Mamba (PVM)

The core contribution is the PVM block, a novel architectural component designed to replace standard Mamba layers. It addresses two specific types of invalidity in patch-based SSMs:

Inner-Patch Invalidity: When a patch contains a mix of valid and invalid pixels.
- Solution: Partial Patch Projection. Instead of a standard Linear layer, a Partial Linear layer is used. This involves a regular Linear layer preceded by mean padding on invalid positions. This ensures that partially valid patches generate valid tokens, effectively ignoring the placeholder values.
Inter-Patch Invalidity: When the SSM processes a sequence containing invalid tokens, corrupting the entire state.
- Solution: Learned Masked Tokens. Invalid tokens are replaced by a specific, learnable "masked token" (inspired by BERT). The SSM is trained to recognize this token and ensure it does not propagate corruption to valid tokens.

C. Architectural Integration

Residual Connections: When used in residual blocks, PVM updates features only at valid positions. The validity mask is retained and updated through the network.
Task-Specific Adaptations:
- Depth Completion: Replaces standard VM blocks with Partial Vision Mamba Modules (PVMM) and uses a "Filling Layer" (iterative PConv) to convert sparse features to dense valid features.
- Image Inpainting: Modifies VM-UNet by replacing Patch Embeddings with Partial Patch Embeddings and VSS blocks with Partial VSS (PVSS) blocks.
- Classification: Replaces Global Pooling with Partial Average Pooling to aggregate only valid features before the classification head.

3. Key Contributions

PVM Block: The first architectural component enabling Vision Mamba to process inputs with arbitrarily shaped invalid data without corrupting the output.
Formalized Framework: A set of rules (mask propagation logic) for designing any mask-aware architecture using PVM.
Generalizability Proof: Demonstrated efficacy across three distinct domains:
- Generative: Image Inpainting.
- Regression: Depth Completion.
- Discriminative: Image Classification with occlusions.

4. Experimental Results

The authors validated PVM against standard (mask-unaware) Mamba baselines and other methods (PConvs, Transformers).

Task	Dataset	Metric	Baseline (Mask-Unaware)	PVM Approach	Improvement
Depth Completion	KITTI-3D (LiDAR)	RMSE (m)	1.80 (VM-DC)	1.38 (PVM-DC)	~23% reduction
Image Inpainting	FFHQ	FID / LPIPS	40.02 / 0.152 (VM-UNet)	37.88 / 0.143 (PVM-UNet-N)	Significant perceptual gain
Classification	ImageNet-1k (Masked)	Top-5 Acc.	25.60% (PlainMamba)	34.93% (PVM-Cls)	~36% relative increase

Depth Completion: PVM-DC achieved a 23% relative improvement in RMSE over VM-DC, operating entirely unguided (no RGB input), proving the value of mask-awareness on sparse geometry.
Inpainting: PVM-based models outperformed both standard VM-UNet and PConv-based UNets. The complex PVM-UNet-N (with full mask-aware skip connections) yielded the best results, reducing artifacts and blurriness.
Classification: PVM-Cls significantly outperformed the vanilla PlainMamba on heavily occluded images, demonstrating robustness to invalid inputs.
Ablation: The use of learned masked tokens for invalid data padding was found to be superior to zero-padding or mean-token padding, though the SSM showed some robustness even with suboptimal padding.

5. Significance

Bridging the Gap: This work fills a critical void in State-Space Models, providing the first mechanism for them to handle real-world, incomplete data at inference time.
Efficiency vs. Performance: It maintains the linear complexity ( $O(N)$ ) and efficiency advantages of Mamba while adding robustness to data corruption, a trait previously only found in specialized CNNs (PConvs).
Future Impact: The proposed framework allows SSMs to be deployed in practical scenarios involving sparse sensors (LiDAR), privacy masking, or occlusion, without requiring complete input data. It establishes a new standard for designing robust, efficient vision backbones.