FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data

Imagine you are a detective trying to solve a massive mystery. You have a room full of millions of suspects (data points), and each suspect has a file with hundreds of clues (features) about them. Your goal is to predict who committed the crime or what they will do next, just by looking at the files of a few known suspects and comparing them to the unknown ones.

This is the world of Structured Data (like spreadsheets, medical records, or financial logs). For a long time, the best detectives (AI models) had a major problem: they were too slow and memory-hungry to look at all the suspects at once.

Here is the story of FEAT, the new detective that solves this problem.

The Problem: The "All-Hands Meeting" Bottleneck

Imagine the old way of doing this. To understand a new suspect, the detective had to call every single other suspect into a room and ask them to compare notes with the new person.

The Math: If you have 100 suspects, that's 10,000 comparisons. If you have 100,000 suspects, that's 10 billion comparisons.
The Result: The computer runs out of memory (the room gets too crowded) or takes days to finish the meeting. This is called Quadratic Complexity ( $O(N^2)$ ). It's like trying to shake hands with everyone in a stadium; it gets impossible as the crowd grows.

The Solution: FEAT (The Efficient Detective)

The authors created FEAT, a new kind of foundation model. Think of FEAT as a detective who doesn't need a giant meeting room. Instead, FEAT uses a smart, two-step filing system that works linearly (one step at a time, no matter how big the crowd is).

Here is how FEAT works, using simple analogies:

1. The "Dual-Axis" Strategy (The Two-Step Dance)

Most AI models try to do everything at once. FEAT splits the job into two distinct dances:

Step A: The Feature Dance (Looking at the Clues)
First, FEAT looks at a single suspect's file. It checks how the clues relate to each other (e.g., "If this person is old, they are likely retired"). It does this for every file independently. This is fast and doesn't require comparing files yet.
- Analogy: Reading a single book to understand its internal plot.
Step B: The Sample Dance (Looking at the Crowd)
Next, FEAT looks at how the files relate to each other. But here is the magic: instead of making everyone talk to everyone, FEAT uses two special tools:
- Tool 1: The "Bi-Mamba" (The Local Gossip): This tool walks down the line of suspects, listening to the immediate neighbors. It remembers the local trends (e.g., "The last 5 people all had blue eyes"). It's fast and remembers the recent past.
- Tool 2: The "Conv-GLA" (The Global Librarian): The "Gossip" tool has a short memory. If the line is too long, it forgets the beginning. So, FEAT adds a "Librarian" who keeps a summary book of the entire crowd. This librarian doesn't read every page; they just update a running summary (a "covariance memory") of what the whole group looks like.
- Analogy: The Gossip tells you what's happening right now, and the Librarian tells you the big picture of the whole room.

Why this is a game-changer: By combining a fast local listener with a summary-keeping librarian, FEAT can process millions of rows without the computer crashing. It scales linearly ( $O(N)$ ), meaning if you double the data, it only takes double the time, not quadruple.

2. Solving the "Permutation" Puzzle

In a spreadsheet, the order of rows doesn't matter. Suspect #100 is the same as Suspect #1 if you swap their places.

The Old Problem: Many fast AI models (like Mamba) were designed for text, where order matters (Sentence 1 comes before Sentence 2). If you feed them a spreadsheet, they get confused and think the order matters, leading to bad guesses.
FEAT's Fix: FEAT uses a special "Identity Card" for every column. It treats every feature (like "Age" or "Income") as a unique character, regardless of where it sits in the file. This ensures the model understands that the data is a bag of clues, not a story with a beginning and end.

3. The "Heavy-Tail" Training (Handling the Weirdos)

Real-world data is messy. Most people have average salaries, but a few billionaires skew the average. This is called a heavy-tailed distribution.

The Old Problem: If an AI tries to learn from these "billionaire" outliers, it gets confused, panics, and its math breaks (gradient explosion).
FEAT's Fix: FEAT was trained on a special mix of synthetic data (fake data made by a smart generator) and real data. The generator was taught to create "billionaires" and "outliers" on purpose. FEAT also uses a "tough love" math rule (Huber Loss) that ignores extreme outliers instead of freaking out about them. This makes FEAT robust enough to handle the messy reality of the real world.

The Results: Speed and Smarts

The paper tested FEAT on 11 different real-world datasets (healthcare, finance, etc.).

Speed: When the dataset grew to 500,000 rows, old models crashed or took 22 seconds. FEAT handled it in 0.5 seconds. That's a 40x speedup.
Accuracy: Despite being faster and simpler, FEAT was just as smart as the slow, heavy models. It could predict outcomes with high accuracy without needing to be retrained for each new task (Zero-Shot Learning).

Summary

FEAT is like upgrading from a detective who needs a massive conference room to hold a meeting with everyone, to a detective who carries a smart notebook.

The notebook has a local gossip section for immediate context.
It has a global librarian section for the big picture.
It knows that order doesn't matter in a spreadsheet.
It isn't scared by weird outliers.

This allows us to analyze massive amounts of structured data (like millions of patient records or financial transactions) instantly, opening the door for AI to make better decisions in healthcare, finance, and science without waiting days for the computer to finish its calculations.

1. Problem Statement

Structured data (e.g., tabular data in finance, healthcare, and e-commerce) is ubiquitous but poses significant challenges for current Foundation Models (FMs). Existing Large Structured-Data Models (LDMs) like TabPFN and LimiX rely on Transformer-based full self-attention, which introduces three critical bottlenecks when scaling to real-world datasets:

Quadratic Complexity ( $O(N^2)$ ): Full self-attention requires constructing an $N \times N$ attention matrix. This limits context windows to roughly 50,000 samples before triggering out-of-memory (OOM) errors, preventing models from observing the true global distribution of massive datasets (millions of rows).
Representation Collapse in Linear Models: Simply replacing attention with linear sequence models (e.g., State Space Models like Mamba) fails on structured data.
- The Linear Trap: Fixed-size hidden states force the compression of global, non-directional interactions into a rigid bottleneck, causing information decay.
- Causal Bias: Standard SSMs are unidirectional and causal, which conflicts with the permutation-invariant nature of tabular data (where row order has no semantic meaning).
Unstable Optimization on Real Data: Real-world structured data is often heavy-tailed and heteroscedastic (containing extreme outliers). Pre-training on synthetic, i.i.d. data with static loss functions (like MSE) leads to gradient explosions and optimization collapse when exposed to real-world noise.

2. Methodology: FEAT Architecture

FEAT (Foundation model for Extremely large structured dAta) is designed to achieve strictly linear complexity ( $O(N)$ ) while maintaining expressive representations. It consists of three core components:

A. Cell-Level Embedding

Unlike NLP models that flatten data, FEAT preserves the 2D structure ( $N$ samples $\times$ $D$ features).

Value Projection: Maps scalar values to a dense embedding space via an MLP.
S-DFE (Subspace Orthogonal Discriminative Feature Encoding): Instead of positional encodings (which imply order), FEAT uses randomly sampled orthogonal matrices to assign unique, equidistant identities to feature columns. This ensures permutation invariance and zero-shot adaptability to unseen schemas.

B. Multi-Layer Dual-Axis Encoding

This is the core innovation, decomposing learning into two orthogonal stages to avoid the $O(N^2)$ bottleneck:

Feature-Axis Modeling (Intra-sample):
- Uses Multi-Head Self-Attention (MHSA) strictly across the feature dimension ( $D$ ).
- Captures complex, non-linear correlations between features within a single sample.
- Complexity: $O(N \cdot D^2)$ .
Sample-Axis Modeling (Inter-sample):
- Replaces quadratic attention with a hybrid linear architecture combining two complementary layers:
  - Adaptive-Fusion Bi-Mamba-2 (AFBM): Three layers of bidirectional Mamba-2. These capture dynamic local dependencies across samples. The bidirectional fusion removes artificial causal bias, allowing the model to see "past" and "future" samples regardless of input order.
  - Convolutional Gated Linear Attention (Conv-GLA): A single layer following the AFBM layers. It addresses the "Linear Trap" by maintaining an explicit global memory reservoir (a covariance matrix).
  - Mechanism: It applies 1D convolution to smooth local noise, then uses a gating mechanism to accumulate key-value interactions into a static memory matrix. This prevents variance accumulation and allows the model to retain global context without compressing it into a small hidden state.

C. Task-Aware Prediction & Pre-training

Prediction Heads: FEAT uses decoupled heads for Classification, Regression, and Missing Value Imputation (MVIH), enabling zero-shot inference without fine-tuning.
Hybrid SCM Pre-training:
- Data Generation: Uses an advanced Structural Causal Model (SCM) pipeline that generates synthetic data with scale-free topologies (hub variables), row-wise clustering (breaking i.i.d. assumptions), and heteroscedastic noise (noise variance scales with signal magnitude).
- Robust Loss: Employs a Huber-based reconstruction loss (Smooth L1) instead of MSE. This acts as $L_2$ for small errors and $L_1$ for large outliers, preventing gradient explosions during training on heavy-tailed distributions.
- Dynamic Balancing: The loss function dynamically balances objectives (classification, regression, imputation) based on the batch composition to avoid gradient domination.

3. Key Contributions

Linear-Complexity Foundation Model: FEAT is the first industrial-grade structured-data foundation model with strictly $O(N)$ complexity, enabling the processing of millions of samples without OOM errors.
Dual-Axis Encoding Architecture: The novel combination of AFBM (for local, bidirectional dynamics) and Conv-GLA (for explicit global memory) solves the representation collapse and causal bias issues inherent in applying linear models to permutation-invariant data.
Stable Pre-training Pipeline: The hybrid SCM generation strategy combined with Huber-based objectives effectively bridges the simulation-to-reality gap, ensuring stable convergence on heavy-tailed, noisy real-world data.
Zero-Shot Scalability: FEAT achieves competitive zero-shot performance across diverse tasks without dataset-specific fine-tuning.

4. Experimental Results

The authors evaluated FEAT on 11 real-world datasets across classification and regression benchmarks (e.g., TabPFN Suite, Tabzilla, GI Benchmark).

Scalability & Efficiency (RQ1):
- FEAT demonstrates linear inference latency growth. At 500,000 samples, FEAT takes ~564ms.
- In contrast, full-attention baselines (TabPFN, LimiX) fail (OOM) or slow down drastically (TabICL v2 takes >22 seconds at 500k samples).
- FEAT achieves up to 40x faster inference compared to existing foundation models at extreme context lengths.
Predictive Performance (RQ2):
- Classification: FEAT matches or outperforms strong baselines (TabPFN 2.5, LimiX, AutoGluon). It achieved the highest AUC (0.9251) on the Tabzilla-CLS benchmark.
- Regression: FEAT remains highly competitive, often outperforming LimiX on large-scale sparse datasets (e.g., CTR23-REG) where linear models typically fail due to variance accumulation.
- Conclusion: FEAT successfully avoids the "linear trap," maintaining expressive power while eliminating the quadratic computational wall.

5. Significance

FEAT represents a paradigm shift in structured data modeling. By proving that linear-complexity architectures can match the performance of quadratic Transformers on tabular data, it unlocks the ability to train foundation models on massive, real-world industrial datasets (millions of rows) that were previously computationally intractable. This enables:

True Global Context: Models can now learn from the full distribution of data rather than small subsets.
Industrial Applicability: The speed and memory efficiency make FEAT viable for real-time decision support in high-frequency trading, large-scale recommendation systems, and dynamic risk assessment.
Robustness: The approach to handling heavy-tailed noise and outliers sets a new standard for training foundation models on messy, real-world data.