FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

Imagine you are a chef trying to teach a new apprentice how to cook a complex, 100-course banquet. You have a massive library of 10,000 recipe books. If you make the apprentice read all of them, it will take years, cost a fortune in electricity, and they might get overwhelmed.

Coreset Selection is the art of picking just a few "perfect" recipe books that, if studied, will teach the apprentice everything they need to know to cook the whole banquet.

The problem? Most existing methods for picking these books are flawed:

The "Expert Bias" Method: They use a specific, famous chef (a Deep Neural Network) to pick the books. But if that chef hates spicy food, they'll only pick spicy recipes, and the apprentice will fail when asked to cook Italian.
The "Gut Feeling" Method: They use simple rules (like "pick the most colorful books"). This often misses the subtle, complex flavors needed for a great dish.

Enter FAST (Frequency-domain Aligned Sampling via Topology). It's a new, smarter way to pick those recipe books without needing a famous chef to guide it.

Here is how FAST works, broken down into simple concepts:

1. The "Sound Wave" Analogy (Frequency Domain)

Imagine every dataset (like a collection of photos) is a piece of music.

Low Frequencies are the bass and drums: the big shapes, the general colors, and the overall vibe (e.g., "This is a picture of a cat").
High Frequencies are the cymbals and violins: the tiny details, the whiskers, the texture of the fur, and the sharp edges.

Most old methods only listen to the bass (low frequencies). They pick books that look like cats but might miss the specific breed or the texture of the fur. FAST listens to the entire symphony, from the deep bass to the highest violin note. It uses a mathematical tool called the Characteristic Function to capture every single note and rhythm in the data, ensuring the small subset of books represents the whole library perfectly.

2. The "Vanishing Phase" Problem

There was a catch with listening to the whole symphony. In the high notes (high frequencies), the volume (amplitude) gets very quiet, almost silent. Because the volume is so low, the old methods ignored the timing (phase) of those notes.

Think of it like a movie with the sound turned down. You can see the actors moving (amplitude), but you can't hear their dialogue (phase). In the high-frequency notes, the "dialogue" contains the most important details (like the texture of a bird's feather or the edge of a building). If you ignore the dialogue, you lose the story.

FAST's Solution: They invented a special "Phase-Decoupled" microphone. It turns up the volume specifically for the timing of the quiet notes, so the system can hear the fine details even when the overall sound is faint. This allows it to capture textures and edges that other methods miss.

3. The "Curriculum Learning" Strategy

Imagine trying to learn a language. If you start by memorizing complex, obscure idioms before you even know the alphabet, you will fail. You need to learn in order: first the letters, then words, then sentences.

FAST does the same with data. It uses a strategy called Progressive Discrepancy-Aware Sampling.

Step 1: It picks books that match the "bass" (the big, global shapes).
Step 2: Once the big picture is right, it starts adding books that match the "mid-range" sounds.
Step 3: Finally, it adds the books with the "high notes" (the tiny, specific details).

This prevents the system from getting confused by the tiny details before it understands the big picture. It builds the perfect subset layer by layer.

4. The "Map" Constraint (Topology)

When you pick a few books from a library of 10,000, you don't want to pick 10 books that are all about "Cats" and ignore "Dogs." You need a map to ensure you cover the whole territory.

FAST builds a Topological Map of the data. It treats the data points like cities on a map connected by roads. It ensures that the few books it picks are spread out across the map, covering every "neighborhood" of the data, rather than clustering in just one corner. This guarantees that the small subset is a true, representative mini-version of the whole.

Why Does This Matter?

It's Faster and Cheaper: Because FAST doesn't need a giant AI chef to do the picking, it runs on a standard computer chip (CPU) and uses almost no electricity. It's 96% more energy-efficient than current methods.
It's Smarter: It works on any type of data, from simple photos to complex textures and even language models. It doesn't care what kind of "chef" (neural network) you eventually use to train on the data.
It's Accurate: By listening to the whole "symphony" of the data and learning in the right order, the small subset it picks teaches the AI just as well as the massive original dataset.

In short: FAST is like a master librarian who doesn't need a computer to pick the best books. Instead, they listen to the "music" of the entire library, learn the songs in the right order, and pick a tiny, perfect playlist that captures the soul of the whole collection.

FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

1. The "Sound Wave" Analogy (Frequency Domain)

2. The "Vanishing Phase" Problem

3. The "Curriculum Learning" Strategy

4. The "Map" Constraint (Topology)

Why Does This Matter?

1. Problem Statement

2. Methodology: The FAST Framework

A. Graph Construction & Topological Constraints

B. Frequency-Domain Distribution Matching (CFD)

C. Progressive Discrepancy-Aware Sampling (PDAS)

3. Key Contributions

4. Experimental Results

5. Significance

FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

1. The "Sound Wave" Analogy (Frequency Domain)

2. The "Vanishing Phase" Problem

3. The "Curriculum Learning" Strategy

4. The "Map" Constraint (Topology)

Why Does This Matter?

1. Problem Statement

2. Methodology: The FAST Framework

A. Graph Construction & Topological Constraints

B. Frequency-Domain Distribution Matching (CFD)

C. Progressive Discrepancy-Aware Sampling (PDAS)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Horseshoe Priors and MDP

Observable Geometry of Singular Statistical Models

Conditional Independence under Infinite Measures and Poisson Point Processes

Sharp Debiasing for Smooth Functional Estimation in Banach Spaces

Opponent-Adjusted Evaluation of NFL Pass Blocking and Pass Rushing Performance