RACE Attention: A Strictly Linear-Time Attention for Long-Sequence Training

Imagine you are trying to organize a massive library with 100 million books.

In the world of Artificial Intelligence (AI), these "books" are pieces of text (tokens), and the "organizer" is a mechanism called Attention. Its job is to figure out which words in a sentence relate to each other. For example, in the sentence "The cat sat on the mat," the word "sat" needs to know about "cat" and "mat."

The Problem: The "Handshake" Bottleneck

The current standard method (called Softmax Attention) works like a giant mixer. To understand one word, it has to shake hands with every single other word in the book to see how relevant they are.

The Math: If you have 1,000 words, the computer does 1,000,000 handshakes ( $1,000 \times 1,000$ ).
The Disaster: If you have 1 million words, the computer has to do 1 trillion handshakes ( $1,000,000 \times 1,000,000$ ).

This is why current AI models crash or take forever when you try to feed them a whole novel or a long video transcript. Even the fastest supercomputers (like the NVIDIA GH200) hit a wall at around 4 million tokens. It's like trying to introduce every person in a stadium to every other person individually before the concert starts.

The Solution: RACE Attention

The paper introduces RACE Attention (Repeated Arrays-of-Count Estimators). Instead of making everyone shake hands, RACE uses a clever sorting hat system.

The Analogy: The Library Sorting Hat

Imagine you are the librarian. Instead of asking every book to talk to every other book, you use a magical sorting hat (the Hashing part of RACE).

The Buckets: You have a set of buckets (let's say 100 of them).
The Sorting: When a book (a word) arrives, the hat quickly decides which bucket it belongs to based on its "vibe" (its meaning). Similar books get thrown into the same bucket.
The Summary: Instead of reading every book in the bucket, you just look at the summary of that bucket. "Bucket A has 50 books about cats."
The Connection: When you need to find information for a specific word, you don't check the whole library. You just check the summaries of the buckets that word was sorted into.

Why is this faster?

Old Way (Softmax): Check 1,000,000 books individually. (Quadratic time: $N^2$ ).
RACE Way: Check 100 bucket summaries. (Linear time: $N$ ).

As the library grows from 1,000 books to 100 million, the old way gets impossibly slow, but the RACE way stays fast because you only ever check a fixed number of buckets.

How It Works (The "Magic" Tricks)

The paper uses two main tricks to make this work without losing accuracy:

Soft Hashing (The "Fuzzy" Bucket):
In the past, sorting methods were "hard"—a book was either in Bucket A or Bucket B. If it was 99% similar to Bucket A but 1% to Bucket B, it got forced into A, losing nuance.
RACE uses Soft Hashing. It's like saying, "This book is 90% in Bucket A and 10% in Bucket B." This allows the AI to learn and adjust smoothly, keeping the math accurate even though it's skipping the full library check.
Sharpening the Lens:
The paper uses a special mathematical "lens" (an angular kernel) that makes the AI very good at spotting the most relevant books. It's like using a magnifying glass that makes the most important words glow bright white and the irrelevant ones fade to gray, so the bucket summaries are very precise.

The Results: Breaking the Limits

The authors tested this on some of the most powerful hardware available (NVIDIA GH200 GPUs and Intel CPUs).

The Old Way: Crashed or took hours at 4 million tokens.
RACE Way: Successfully processed 12 million tokens on a GPU and a staggering 75 million tokens on a standard CPU in a single pass.

The "Right Algorithm Beats Hardware" Moment:
The most impressive part of the paper is that RACE running on a standard, slow CPU was actually 40 times faster than the most advanced, expensive GPU running the old method when dealing with huge amounts of text. It proves that a smart algorithm is more powerful than just throwing more money at hardware.

Summary

RACE Attention is a new way for AI to read long documents. Instead of trying to read every word against every other word (which is slow and expensive), it groups words into "buckets" and reads the summaries.

Old AI: "I need to read the whole encyclopedia to understand this one sentence."
RACE AI: "I'll check the index, find the relevant chapters, and read the summaries."

This allows AI to finally handle massive contexts—like entire books, long codebases, or hours of video—without running out of memory or time, making long-context AI accessible on regular computers.

1. Problem Statement

The standard Softmax Attention mechanism in Transformers has a time and memory complexity of $O(N^2)$ with respect to the sequence length $N$ . This quadratic barrier makes training and inference on extremely long contexts (e.g., hundreds of thousands to millions of tokens) computationally prohibitive, even with state-of-the-art GPU optimizations like FlashAttention-2/3.

Current Limits: Even on high-end hardware (NVIDIA GH200 with 96GB VRAM), FlashAttention-2/3 cannot complete a single forward-backward pass for a sequence exceeding $\sim4$ million tokens.
Existing Alternatives: Previous linear-time approximations (e.g., Linear Attention, Performer, Linformer, YOSO) suffer from significant drawbacks:
- Accuracy Degradation: They often fail to match the performance of Softmax Attention.
- Theoretical Gaps: Many lack rigorous approximation guarantees or mechanisms for causal (autoregressive) modeling.
- Scalability Issues: Some methods (like Performer) remain quadratic in embedding size, while others (like YOSO) struggle with differentiability and end-to-end training stability.

2. Methodology: RACE Attention

The authors propose RACE (Repeated Arrays-of-Count Estimators) Attention, a strictly linear-time ( $O(N)$ ) and linear-space alternative to Softmax Attention.

Core Concept

Instead of the exponential kernel used in Softmax ( $e^{QK^T}$ ), RACE utilizes a sharpened angular kernel based on cosine similarity:
$\text{sim}(Q_i, K_j) = \left( 1 - \frac{\cos^{-1}(Q_i^\top K_j)}{\pi} \right)^\gamma$
where $\gamma$ is a sharpening parameter. As $\gamma$ increases, this polynomial kernel mimics the behavior of the exponential function in Softmax, creating a highly non-linear, selective attention mechanism.

Algorithmic Innovation

To compute attention with this kernel in linear time, RACE leverages Locality-Sensitive Hashing (LSH) and RACE sketches:

Soft Bucketization: Instead of hard hashing (which is non-differentiable), RACE uses a differentiable soft assignment. Queries and Keys are projected via random hyperplanes and assigned to $R = 2^P$ "buckets" (corners of a hypercube) using a softmax distribution over the projections.
Sketching: For each of $L$ $L$ independent hash tables, the algorithm aggregates statistics:
- Mass Vector ( $A$ ): The sum of soft probabilities for keys in each bucket.
- Value Sum Matrix ( $B$ ): The sum of values weighted by their soft probabilities in each bucket.
Approximation: The output for a query is computed by mixing the bucket summaries ( $B$ ) weighted by the query's soft probabilities, normalized by the mass vector ( $A$ ). This avoids constructing the full $N \times N$ attention matrix.
Causal Support: The method is adapted for autoregressive tasks (causal masking) by maintaining cumulative bucket statistics in a single streaming pass, enabling efficient training on CPUs and GPUs.

Differentiability

A key technical hurdle was making the discrete RACE sketch differentiable. The authors replace hard hashing with smooth soft assignments (using $\tanh$ and softmax over hypercube corners), allowing for end-to-end training via standard backpropagation.

3. Key Contributions

Strictly Linear Complexity: RACE achieves $O(N)$ time and space complexity in sequence length, independent of the quadratic bottleneck.
Theoretical Guarantees: The paper provides a rigorous theoretical framework (Theorem 2) establishing approximation bounds. It proves that the Root-Mean-Square (RMS) error between RACE and the target Angular Attention scales as $O(P/\beta + \sqrt{\log(N/L)/L})$ , where $P$ is the number of hyperplanes, $\beta$ is temperature, and $L$ is the number of hash tables.
Causal and Bidirectional Support: Unlike many linear attention methods, RACE supports both causal (autoregressive) and non-causal (bidirectional) settings efficiently.
Hardware Agnostic Efficiency: The method is designed to run efficiently on both CPUs and GPUs, utilizing custom OpenMP and CUDA kernels.

4. Experimental Results

The authors evaluated RACE across language modeling, masked language modeling, text/image classification, and long-context reasoning.

Accuracy: RACE matches or outperforms strong baselines (FlashAttention-2, Linear Attention, Linformer, Performer) on standard benchmarks (WikiText-103, PTB, CIFAR-10, SST-2, etc.) up to sequence lengths of 64K tokens.
Extreme Scaling (The "100M Token" Test):
- GPU (NVIDIA GH200): RACE successfully processed 12 million tokens in a single forward-backward pass. In contrast, FlashAttention-2/3 failed at $\sim4$ million tokens due to memory/time constraints. At 4M tokens, RACE was 5,500x faster than FlashAttention-2.
- CPU (Intel Xeon Gold 5220R): RACE processed 75 million tokens in a single pass. FlashAttention became prohibitively slow ( $\sim10^5$ seconds) at just 2 million tokens. RACE was >10,000x faster than FlashAttention at 33M tokens.
Comparison to YOSO: RACE outperforms YOSO (a similar angular kernel approach) by providing differentiable training, better theoretical guarantees, and superior scalability in embedding dimensions.

5. Significance

Breaking the Quadratic Barrier: RACE demonstrates that it is possible to train Transformers on sequences orders of magnitude longer than currently feasible on commodity hardware without sacrificing accuracy.
Algorithmic vs. Hardware Acceleration: The results highlight that algorithmic improvements (linear complexity) can outperform raw hardware acceleration (GPU kernels) for long sequences. RACE on a single CPU outperformed FlashAttention on a top-tier GPU for sequences >4M tokens.
Practical Deployment: By reducing memory footprints and enabling long-context training on standard hardware, RACE opens the door for applications requiring multi-document reasoning, long-form code generation, and full-book analysis that were previously inaccessible.
Future Directions: The paper suggests that RACE's sketching mechanism offers a clear path for optimized key-value caching during inference, potentially revolutionizing how long-context LLMs are deployed.

In summary, RACE Attention provides a theoretically grounded, strictly linear-time mechanism that overcomes the quadratic limitations of Softmax, enabling the training of models on context windows of tens of millions of tokens with state-of-the-art accuracy.