Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding

Imagine you are trying to teach a robot how to understand 3D objects (like chairs, tables, or cars) just by looking at a cloud of dots (a "point cloud") that represents them.

The problem is, real life is messy. Sometimes the robot sees a chair from the front, sometimes from the side, sometimes it's missing a leg, and sometimes the data is noisy (like static on an old TV). Most AI models are like students who only studied for one specific test in a quiet library. When you put them in a noisy, chaotic classroom (a new domain), they panic and fail.

This paper introduces a new system called SADG (Structure-Aware Domain Generalization) that helps the robot stay calm and understand the shape of things, no matter how messy the data gets. Here is how it works, using some simple analogies:

1. The Problem: The "Random Line" vs. The "Smart Map"

To understand a 3D object, the AI has to read the dots in a specific order, like reading a book.

Old Methods (Transformers): These are like trying to read a book where the pages are shuffled randomly. The AI can guess the meaning of the whole story, but it's slow and expensive (like trying to read a library of books at once).
Newer Methods (Mamba): These are like reading a book one page at a time, which is fast. But, most Mamba models read the pages based on their physical coordinates (e.g., "read the dot at x=1, then x=2").
- The Flaw: If you rotate the chair, the "x=1" dot might suddenly be on the other side of the room! The AI gets confused because the order of the story changed just because the object moved. It's like trying to read a story where the sentences jump around every time you turn the book.

2. The Solution: "Structure-Aware Serialization" (SAS)

The authors realized the AI needs a map that doesn't change when you rotate the object. They invented two new ways to order the dots:

The "Centroid Compass" (CDS): Imagine the object has a center of gravity (the centroid). Instead of reading left-to-right, the AI starts at the center and spirals outward, like a spider walking from the middle of its web to the edges. No matter how you spin the chair, the spider always starts in the middle and walks out. This keeps the "topology" (the big picture structure) intact.
The "Curvature Compass" (GCS): Imagine the object is a piece of clay. Some parts are flat, and some are bumpy. The AI measures how "curvy" each part is. It reads the flat parts first, then the bumpy parts. This is like reading a story by emotional intensity rather than by page number. Even if the object is noisy or missing pieces, the "bumpiness" of the surface stays the same.

The Result: The AI now has a "Smart Map" of the object. It reads the dots in an order that makes sense geometrically, not just mathematically.

3. The "Group Study" (Hierarchical Domain-Aware Modeling)

The AI needs to learn from many different types of data (synthetic computer graphics, real laser scans, etc.).

The Old Way: Throwing all the data into one big pile and hoping the AI figures it out. This causes confusion.
The SADG Way: The AI does a "Group Study" in two steps:
1. Intra-domain: It studies each group separately first (e.g., "Let's master the computer graphics data").
2. Inter-domain: Then, it mixes them up, but carefully. It interleaves the data like shuffling two decks of cards together so that the AI can see the similarities between a computer-generated chair and a real-life chair side-by-side. This helps it learn the universal rules of what a chair looks like, regardless of the source.

4. The "Magic Tuner" (Spectral Graph Alignment)

When the AI faces a brand new object it has never seen before (the "Test Time"), it can't retrain itself. It needs a quick fix.

The Analogy: Imagine you are playing a guitar, but the room is very echoey (the new domain). You can't rebuild the guitar, but you can adjust the tuning pegs slightly to make the sound clear.
How it works: The AI looks at the "vibrations" (spectral graph) of the new object. It gently shifts the new object's features to match the "vibrations" of the objects it already knows. It's like a translator who instantly adjusts their accent to match the listener, without needing to learn a new language. This happens in a split second without changing the AI's brain.

5. The New Playground (MP3DObject)

To prove this works, the authors built a new dataset called MP3DObject.

The Analogy: Most training datasets are like a clean, well-lit toy store. The new dataset is like a messy, real-world living room with furniture in weird angles, missing parts, and shadows. It's a much harder test, and the new AI passed it with flying colors.

Summary

In short, this paper teaches a fast AI (Mamba) how to understand 3D shapes by:

Ordering the dots based on their shape and curves, not just their coordinates (so rotation doesn't confuse it).
Studying different data types together in a smart, structured way.
Quickly tuning itself to new environments without needing to relearn everything.

It's like giving the robot a pair of glasses that lets it see the true structure of an object, even when the object is broken, rotated, or covered in noise.

1. Problem Statement

The paper addresses the challenge of Multi-Task Domain Generalization (DG) for 3D point clouds. While recent architectures like Transformers and Mamba (State-Space Models) have advanced point cloud representation, they face significant limitations in real-world scenarios:

Single-Task/Single-Domain Bias: Most models are trained for specific tasks (e.g., reconstruction only) or specific domains, failing to generalize to unseen domains (e.g., different sensors, noise levels, or viewpoints) or handle multiple tasks simultaneously.
Transformer Limitations: Transformers capture global dependencies but suffer from quadratic computational complexity ( $O(N^2)$ ) and lack explicit structural ordering, making them inefficient and prone to structural drift.
Mamba Limitations: Mamba offers linear-time efficiency ( $O(N)$ ) but relies heavily on coordinate-driven serialization (e.g., Axis Scanning, Hilbert Curves). These methods are sensitive to viewpoint changes, occlusions, and missing regions. When the coordinate system shifts, the serialization order breaks the intrinsic topological and geometric hierarchy, causing unstable state propagation and degraded generalization.
The Core Gap: Existing In-Context Learning (ICL) methods (like DG-PIC) unify tasks but rely on Transformers. Replacing them with Mamba without addressing serialization leads to structural drift, where the sequence order no longer reflects the object's true geometry under domain shifts.

2. Methodology: Structure-Aware Domain Generalization (SADG)

The authors propose SADG, the first Mamba-based In-Context Learning framework designed to preserve structural hierarchy across domains and tasks. The framework consists of three core components:

A. Structure-Aware Serialization (SAS)

Instead of sorting tokens based on raw coordinates, SADG generates transformation-invariant sequences based on intrinsic geometric properties. It constructs a token graph and defines two spectral ordering strategies:

Centroid Distance Spectrum (CDS):
- Goal: Preserve global topology.
- Mechanism: Computes a global centroid and performs a Breadth-First Search (BFS) on a token graph where edge weights are based on spatial proximity.
- Implementation: To ensure GPU efficiency, the BFS is approximated using the Fiedler vector (the second smallest eigenvector of the graph Laplacian), which provides a smooth, topology-aware ordering without sequential queue operations.
Geodesic Curvature Spectrum (GCS):
- Goal: Capture local surface continuity and curvature.
- Mechanism: Uses geodesic distances (shortest paths along the surface) rather than Euclidean distances to build a graph. It applies a heat diffusion process (Laplace-Beltrami operator) to implicitly estimate curvature.
- Ordering: Tokens are sorted based on their diffusion response (curvature), ensuring that the sequence follows the surface geometry even in noisy or incomplete scans.
- Unified Sequence: The final input to Mamba concatenates bidirectional traversals of both CDS and GCS sequences, providing a rich, structure-consistent context.

B. Hierarchical Domain-Aware Modeling (HDM)

To stabilize reasoning across different domains within the Mamba backbone:

Intra-Domain Structural Modeling (ISM): Processes prompt (source) and query (target) tokens independently through parallel Mamba branches first. This ensures that stable topological patterns are established within each domain before interaction.
Inter-Domain Relational Fusion (IRF): Instead of simple concatenation, tokens from the source and target domains are interleaved based on their shared structural order ( $\pi$ ). This allows the Mamba recurrence to exchange features between domains at every step, fostering transferable correspondences without disrupting the sequential flow.

C. Spectral Graph Alignment (SGA)

A lightweight, test-time adaptation module that operates without updating model parameters:

Mechanism: Treats the serialized token sequences as graph signals. It projects these signals into the spectral domain (using the eigenvectors of the token graph Laplacian).
Alignment: Computes source domain prototypes in the spectral space and shifts the target features toward these prototypes using an adaptive coefficient based on cosine similarity.
Benefit: This corrects domain discrepancies in the spectral domain, ensuring that the target features align with source structural priors while preserving the intrinsic geometry of the object.

3. Key Contributions

Framework Innovation: Proposed SADG, the first Mamba-based ICL framework for multi-task point cloud DG, explicitly solving the "structural drift" problem caused by coordinate-based serialization.
Novel Serialization: Introduced Structure-Aware Serialization (SAS) using Centroid Distance and Geodesic Curvature spectra to create transformation-invariant token sequences that respect global topology and local geometry.
Model Architecture: Designed Hierarchical Domain-Aware Modeling (HDM) to stabilize cross-domain reasoning via intra-domain consolidation and interleaved inter-domain fusion.
Test-Time Adaptation: Developed Spectral Graph Alignment (SGA) for parameter-free, structure-preserving feature shifting at test time.
New Benchmark: Introduced MP3DObject, a new dataset derived from Matterport3D containing real-world, object-level scans with diverse poses, occlusions, and sensor noise, specifically designed to challenge synthetic-to-real generalization.

4. Experimental Results

The method was evaluated on a unified benchmark involving five datasets (ModelNet, ShapeNet, ScanNet, ScanObjectNN, and MP3DObject) across three tasks: Reconstruction, Denoising, and Registration.

Performance: SADG achieved State-of-the-Art (SOTA) results across all datasets and tasks.
- On the challenging MP3DObject (real scans), it significantly outperformed the previous best (DG-PIC) and vanilla Mamba ICL. For example, in Reconstruction, it reduced Chamfer Distance (CD) from 8.28 (Vanilla Mamba) to 3.55.
- It consistently outperformed Transformer-based baselines (Point-MAE, DG-PIC) and other DG methods (PointMixup, PointCutMix).
Efficiency: SADG is more efficient than Transformer-based DG-PIC.
- Inference Time: 0.75s vs. 0.94s (DG-PIC).
- Parameters: 18.87M vs. 27.57M.
- FLOPs: 14.89G vs. 21.07G.
Ablation Studies:
- Removing SAS (using Z-order or Hilbert curves) caused performance to drop significantly, confirming the necessity of structure-aware ordering.
- Removing HDM or SGA led to unstable cross-domain reasoning and higher error rates.
Qualitative Analysis: Visualizations showed that SADG preserves thin structures, recovers missing surfaces better, and maintains smoother boundaries compared to baselines, which often produced fragmented or topologically broken outputs.

5. Significance

This paper represents a significant step forward in 3D computer vision by bridging the gap between efficient sequence modeling (Mamba) and robust geometric understanding.

Theoretical Impact: It demonstrates that for point clouds, the order of tokens is as critical as the features themselves. By moving away from coordinate-dependent ordering to intrinsic geometric ordering, the authors enable Mamba to effectively model 3D structures.
Practical Impact: The ability to handle multiple tasks (reconstruction, denoising, registration) simultaneously with a single model that generalizes to unseen real-world sensors and conditions is crucial for robotics, autonomous driving, and AR/VR applications.
Resource Contribution: The release of MP3DObject provides a much-needed benchmark for evaluating "synthetic-to-real" generalization in complex, cluttered indoor environments, pushing the community toward more robust 3D perception systems.