Low-Rank Thinning

Imagine you have a massive library containing millions of books. You need to explain the entire collection to a friend, but you only have time to show them a few pages. How do you choose which pages to show so that your friend gets the exact same feeling and understanding as if they had read the whole library?

This is the problem of Thinning. In the world of data science, "thinning" means taking a huge dataset and picking a tiny, representative handful of points to summarize the whole thing.

For a long time, the best way to do this was like picking books at random. It works okay, but it's inefficient. You might pick 10,000 random pages just to get a decent summary.

This paper introduces a new, smarter way to thin data called Low-Rank Thinning. Here is the breakdown using simple analogies.

1. The Problem: The "Pessimistic" Old Way

Imagine you are trying to describe a complex painting to someone over the phone.

The Old Method (Uniform Subsampling): You close your eyes and point at random spots on the canvas, describing whatever you see. To get a good description, you have to point at thousands of random spots. It's slow, and you might miss the most important details (like the face in the portrait).
The Flaw: Previous "smart" methods tried to be better, but they had a major weakness: they assumed the data was messy and high-dimensional (like a painting with infinite colors and textures). Because of this, their math was "pessimistic"—they had to pick way too many points to guarantee accuracy, especially as the data got more complex.

2. The Solution: Finding the "Skeleton" (Low-Rank)

The authors realized that most real-world data isn't actually as messy as we think. It usually has a hidden, simple structure.

The Analogy: Think of a 3D sculpture. From the outside, it looks complex. But if you look at its "skeleton" (the wireframe inside), it might only have a few main beams holding it up.
The "Low-Rank" Insight: In math terms, this "skeleton" is called being low-rank. It means the data can be compressed into a few key directions without losing much information.
The New Method: Instead of guessing randomly, the new algorithm looks for that hidden skeleton. Once it finds the simple structure, it knows exactly which points are the "skeleton" and which are just "flesh" (redundant details). It can then throw away the flesh and keep only the skeleton.

3. How It Works: The "Smart Filter"

The paper proposes a new mathematical filter that works like a high-tech sieve.

If the data is messy and complex, the sieve lets a few points through.
If the data has a simple structure (low-rank), the sieve instantly recognizes it and lets through a tiny, perfect summary.
The Result: You get a summary that is just as accurate as the old "random" method, but you only need a fraction of the points. It's like summarizing a 500-page novel in just 5 pages without losing the plot.

4. Real-World Superpowers

The authors didn't just do the math; they showed how this "Smart Filter" can fix three huge problems in modern AI:

A. The "Transformer" Bottleneck (Chatbots and Image Generators)

The Problem: Modern AI (like the models that write this text or generate images) uses something called "Attention." It's like the AI trying to read every single word in a book to understand one sentence. If the book is huge, this takes forever and crashes the computer.
The Fix: The authors created a tool called Thinformer. It uses their low-rank filter to ignore the boring, repetitive words and only focus on the "skeleton" words that actually matter.
The Win: They made AI models run much faster (sometimes 2x or 3x faster) while actually getting more accurate results than previous fast methods.

B. Training AI Faster (The "Running" Analogy)

The Problem: Teaching an AI is like running a marathon. You usually run in a random order (Random Reshuffling). Sometimes you hit a patch of mud (bad data) that slows you down.
The Fix: The new method acts like a smart coach. It looks at the terrain (the data gradients), sees where the mud is, and rearranges your running order so you hit the smooth paths first. It uses the "low-rank" structure of the mud to predict the best path.
The Win: The AI learns the same amount of information in fewer steps, saving massive amounts of time and electricity.

C. The "Lie Detector" (Testing Distributions)

The Problem: Imagine you have two jars of marbles. You want to know if they came from the same factory or different ones. To be sure, you used to have to count and measure every single marble in both jars. This takes forever.
The Fix: The new method uses the "skeleton" idea to pick just a few marbles from each jar that represent the whole.
The Win: You can tell if the jars are different almost instantly (in "near-linear time") with the same accuracy as counting every single marble. This is huge for detecting fraud or anomalies in massive datasets.

The Big Picture

The core message of this paper is: Don't treat all data as equally complex.

Most data has a simple "skeleton" hidden inside. By finding that skeleton first, we can throw away 99% of the data without losing any meaning. This makes AI faster, cheaper to run, and more accurate, allowing us to build bigger and better models without needing supercomputers for everything.

In short: They found a way to summarize a library by reading just the table of contents and the first sentence of every chapter, and it turns out that's enough to know the whole story perfectly.

1. Problem Statement

Thinning is the problem of summarizing a large dataset $X_{in}$ (of size $n_{in}$ ) with a small, representative subset $X_{out}$ (of size $n_{out} \ll n_{in}$ ) such that the empirical distribution of the subset closely approximates the original distribution.

While uniform subsampling is the standard baseline, it suffers from poor summarization quality, requiring $O(n_{in}^2)$ points to achieve low error in many metrics. Existing advanced thinning algorithms (e.g., Kernel Halving, Compress) offer better theoretical guarantees but are limited by:

Restricted Applicability: Their guarantees often apply only to specific classes of kernels or distributions.
Pessimistic Dimension Dependence: Their error bounds typically scale with the ambient data dimension $d$ (e.g., $O(\sqrt{d/n_{out}})$ ), which is suboptimal for high-dimensional data where the intrinsic dimensionality is low.
Lack of Low-Rank Adaptivity: They do not explicitly leverage the fact that many real-world datasets and kernel matrices are approximately low-rank.

The paper aims to develop a thinning framework that adapts to the low-rank structure of the data or the kernel, providing guarantees that depend on the effective rank rather than the ambient dimension.

2. Methodology

A. Theoretical Framework: Low-Rank Sub-Gaussian Thinning

The authors introduce a new analysis for sub-Gaussian thinning algorithms. An algorithm is $(K, \nu, \delta)$ -sub-Gaussian if the difference between input and output probability vectors satisfies specific exponential moment bounds involving the kernel matrix $K$ .

Key Theoretical Result (Theorem 1):
The paper proves that for any sub-Gaussian thinning algorithm, the quality of the summary (measured by Kernel Maximum Mean Discrepancy - MMD, and Kernel Max Seminorm - KMS) depends on the approximate rank of the kernel matrix $K$ (or the data matrix $X$ ).

MMD Bound: The squared MMD is bounded by $O(\nu^2 r + \lambda_{r+1})$ , where $r$ is an approximate rank parameter and $\lambda_{r+1}$ is the $(r+1)$ -th eigenvalue of $K$ .
KMS Bound: The error scales with $\nu \sqrt{\text{rank}(X_I)}$ rather than the ambient dimension $d$ .
Implication: If the kernel matrix has rapidly decaying eigenvalues (common in Gaussian kernels or deep kernels) or the data lies on a low-dimensional manifold, the thinning error is significantly smaller than previous bounds suggested.

B. Algorithmic Contributions

The authors design and analyze several practical thinning algorithms that fit this framework:

KH-COMPRESS: A recursive halving and compression algorithm based on Kernel Halving.
GS-THIN & GS-COMPRESS: New algorithms based on the Gram-Schmidt Walk. These achieve even smaller sub-Gaussian parameters ( $\nu = O(1/n_{out})$ ) compared to previous methods, though at a higher computational cost ( $O(n_{in}^3)$ or $O(n_{out}^3)$ ).
LKH (Linear Kernel Halving): A specialized version for linear kernels that runs in $O(nd)$ time, crucial for gradient reordering applications.

3. Key Contributions & Applications

The paper translates the theoretical low-rank guarantees into three major applications, demonstrating state-of-the-art performance:

Application 1: Approximating Attention in Transformers (Thinformer)

Problem: Dot-product attention in Transformers requires $O(d n^2)$ time, creating a bottleneck for long sequences.
Solution: Thinformer treats attention approximation as a thinning problem. It defines a specific "key-value attention kernel" and uses KH-COMPRESS to select a subset of key-value pairs.
Result: Thinformer achieves a worst-case error bound of $O(n^{-a})$ for subquadratic runtime $O(n^{1+a})$ . This improves upon prior methods (KDEformer, HyperAttention) by offering faster error decay and reduced dependence on the value matrix norm.
Empirical: On ImageNet classification (T2T-ViT) and BigGAN image generation, Thinformer achieves higher accuracy and better generation quality (FID/IS) than competitors while running faster.

Application 2: Accelerating Stochastic Gradient Descent (SGD)

Problem: The convergence rate of SGD with random reshuffling is limited by a dimension-dependent factor $\Theta(d)$ .
Solution: The authors propose LKH-SGD, which uses the Linear Kernel Halving algorithm to reorder data points (gradients) adaptively. This bridges the gap between theory and practice by using a fast, parameter-free algorithm.
Result: The convergence rate improves to depend on the $\epsilon$ -rank of the gradient matrices rather than the ambient dimension $d$ .
Empirical: In mortgage classification tasks, LKH-SGD converges significantly faster than random reshuffling and matches the performance of the theoretically sound but practically slow CD-GraB: SBW algorithm.

Application 3: Cheap Two-Sample Testing (Compress Then Test - CTT)

Problem: Kernel two-sample tests (MMD) are powerful but computationally expensive ( $O((m+n)^2)$ ).
Solution: The authors refine the Compress Then Test (CTT) approach. By applying their low-rank analysis to the compression step, they derive non-asymptotic power guarantees for deep neural network kernels.
Result: CTT can distinguish distributions in near-linear time while maintaining the detection power of quadratic-time tests, provided the induced kernel matrices have rapid eigenvalue decay (a property of deep kernels).
Empirical: In Higgs boson detection experiments, CTT matches the power of exact MMD tests with a fraction of the computational cost.

4. Key Results Summary

Metric	Uniform Subsampling	Prior Best (e.g., KH, SBW)	Low-Rank Thinning (This Work)
MMD Error	$O(1/\sqrt{n_{out}})$	$O(\sqrt{d/n_{out}})$	$O(\sqrt{r/n_{out}})$ (where $r \ll d$ )
Dimension Dependence	None (but poor rate)	Linear in $d$	Depends on effective rank/intrinsic dim
Applicability	Any	Restricted kernels/distributions	Any kernel, any distribution (if low-rank)
Attention Approx.	N/A	$O(n^{-a/2})$ or $O(n^{-a/6})$	$O(n^{-a})$ with better constants
SGD Convergence	$O(1/nK^2)$	$O(d/n^2K^2)$	$O(\text{rank}/n^2K^2)$

5. Significance

Theoretical Breakthrough: The paper resolves the issue of pessimistic dimension dependence in thinning. By shifting the focus from ambient dimension $d$ to the approximate rank or intrinsic dimension, it provides tight, minimax-optimal guarantees for a much broader class of problems.
Practical Impact: The proposed algorithms (Thinformer, LKH-SGD, CTT) are not just theoretical constructs; they are implemented and shown to outperform state-of-the-art methods in real-world tasks (NLP, Vision, Optimization, Statistics).
Efficiency: The work enables high-quality data summarization and distribution testing in near-linear time, making large-scale machine learning and statistical inference more accessible and energy-efficient.
Unification: It unifies diverse problems (attention mechanisms, gradient ordering, hypothesis testing) under a single "low-rank thinning" framework, suggesting that exploiting low-rank structures is a universal strategy for accelerating machine learning workflows.

In conclusion, "Low-Rank Thinning" provides a rigorous theoretical foundation and practical tools for compressing datasets without sacrificing quality, specifically by leveraging the low-rank nature of modern data and kernels.