A Comparative Study of UMAP and Other Dimensionality Reduction Methods

Imagine you have a massive library filled with millions of books. Each book is described by thousands of details: the author's name, the font size, the color of the cover, the number of pages, the smell of the ink, the temperature of the room it was written in, and so on. This is what data scientists call high-dimensional data. It's too much to hold in your head at once, and it's hard to find patterns.

Dimensionality Reduction is like hiring a super-smart librarian who can summarize all those thousands of details into just a few key tags (like "Mystery," "Romance," or "Sci-Fi") so you can easily organize and find what you need.

This paper is a "taste test" comparing different librarians (algorithms) to see who does the best job. The main star of the show is a new, trendy librarian called UMAP (Uniform Manifold Approximation and Projection). The authors wanted to see if UMAP is as good at organizing books when you give it a specific goal (like "find me all the mystery novels") compared to when it just organizes them on its own.

Here is the breakdown of the study using simple analogies:

1. The Contestants (The Methods)

The paper pits UMAP against some old-school and some other new librarians:

PCA (Principal Component Analysis): The "Old School" librarian. It's fast and simple, but it only looks for the biggest, most obvious differences. It's like saying, "All books with red covers go here." It misses subtle, complex patterns.
t-SNE: The "Artistic" librarian. It's amazing at grouping books that look very similar (local patterns) but sometimes loses the big picture of how the whole library is arranged (global structure). It's also slow and can't easily organize new books that arrive later.
SIR (Sliced Inverse Regression): The "Targeted" librarian. It's a specialist that only works if you tell it what you are looking for (the "response"). It's great at finding the specific path to the answer.
UMAP (The Star): The "Modern" librarian. It's famous for being fast and great at keeping both the small details and the big picture intact.

2. The Two Challenges: Classification vs. Regression

The researchers tested these librarians in two different scenarios:

Scenario A: The Sorting Game (Classification)

The Task: "Sort these books into 10 distinct piles based on genre." (The answer is a category: Mystery, Romance, etc.)
The Result: UMAP won. When the librarian was told, "Hey, these books are all Mysteries, keep them close together," UMAP did an incredible job. It created a map where all the Mysteries were in one corner and all the Romances in another. It was even better than the old-school methods.
Analogy: If you ask UMAP to sort your friends into "Sports," "Art," and "Music" groups, it does a fantastic job of clustering them perfectly.

Scenario B: The Prediction Game (Regression)

The Task: "Predict the exact price of a house based on its features." (The answer is a number, like $450,200.50).
The Result: UMAP stumbled. This is the big surprise. When the researchers told UMAP, "Hey, these houses cost $500k and those cost $600k, keep them close in that order," UMAP got confused. It actually performed worse than when it was left to its own devices (Unsupervised).
The Problem: It seems UMAP tried to memorize the exact prices too hard, like a student cramming for a test by memorizing the answer key instead of learning the concept. When it tried to predict the price of a new house, it failed because it had "overfit" (memorized the training data too specifically).
The Winner: The "Targeted" librarian (SIR) actually did the best here. It knew how to use the price information to find the right path without getting confused.

3. The Real-World Test

The researchers didn't just use made-up data; they tested on real things:

Fashion Photos: They tried to sort photos of clothes (T-shirts vs. boots). UMAP crushed it. It separated the clothes perfectly.
News Popularity: They tried to predict how many times a news article would be shared (a number). SIR won again. UMAP struggled to use the "number of shares" to help organize the articles effectively.

The Big Takeaway

Think of UMAP as a brilliant artist who is amazing at drawing a map of a city where all the coffee shops are close together and all the parks are in a big green zone. If you ask it to sort things into categories (Coffee vs. Parks), it's the best in the world.

However, if you ask it to draw a map where the distance between points represents a specific number (like "this house is exactly $50,000 more expensive than that one"), it gets a bit lost. It tries to force the numbers into its artistic style and ends up making mistakes.

The Conclusion:

For Sorting/Categorizing (Classification): Use Supervised UMAP. It's a powerhouse.
For Predicting Numbers (Regression): Be careful with Supervised UMAP. It currently struggles to use the "answer" to help organize the data. The older, more specialized methods (like SIR) are currently more reliable for this job.

The authors are essentially saying: "UMAP is a fantastic tool, but we need to teach it a new trick before it can master the art of predicting numbers."

1. Problem Statement

Dimensionality reduction is essential for managing high-dimensional data in fields like genomics, image processing, and NLP. While Uniform Manifold Approximation and Projection (UMAP) has gained significant popularity for preserving both local and global structures in unsupervised settings, its supervised extensions remain underexplored, particularly in regression contexts.

Existing literature on supervised UMAP focuses heavily on classification. There is a critical knowledge gap regarding whether supervised UMAP can effectively incorporate continuous response variables (regression) to improve predictive performance compared to established methods like Sliced Inverse Regression (SIR) or unsupervised techniques like PCA and t-SNE. This paper aims to fill that gap by systematically evaluating supervised UMAP against competing methods in both regression and classification tasks.

2. Methodology

The study conducts a comprehensive comparative analysis of the following methods:

UMAP Variants:
- Unsupervised UMAP (UU): Preserves manifold structure without response information.
- Supervised UMAP (CoSU): Uses continuous response distances directly to modify edge weights in the high-dimensional graph (existing method).
- Supervised UMAP (CaSU): Treats continuous responses as unique categorical classes (existing method).
- Supervised UMAP (SSU): A proposed modification where continuous responses are sliced into intervals and treated as categorical labels to mitigate overfitting.
Competing Methods:
- Linear: Principal Component Analysis (PCA), Sliced Inverse Regression (SIR).
- Non-linear/Kernel: Kernel PCA (KPCA), Kernel SIR (KSIR), t-distributed Stochastic Neighbor Embedding (t-SNE).

Experimental Design:

Simulation: 12 distinct settings generated from three feature distributions (Independent Gaussian, Independent Non-Gaussian, Correlated Gaussian) and four response models (three continuous non-linear models, one binary classification model). Sample size $n=1000$ , dimension $p=500$ .
Real-World Data:
- Fashion-MNIST: 70,000 images (28x28 pixels) for classification (10 classes).
- Online News Popularity: 39,644 articles with 60 features and a continuous response (social media shares) for regression.
Evaluation Protocol:
- Data is split 50/50 into training and testing sets.
- Dimensionality reduction is fitted on training data; test data is transformed using the fitted model.
- Predictive Performance: Measured using K-Nearest Neighbors (KNN) on the reduced embeddings.
  - Regression: Mean Squared Error (MSE).
  - Classification: Misclassification Error Rate.

3. Key Contributions

First Systematic Evaluation: This is the first study to provide a comprehensive empirical evaluation of supervised UMAP specifically for regression tasks, comparing it against sufficient dimension reduction (SDR) methods like SIR and KSIR.
Identification of Regression Limitations: The authors demonstrate that while supervised UMAP excels in classification, the current implementation for continuous responses fails to effectively utilize response information, often leading to overfitting and worse performance than unsupervised UMAP.
Proposed Slicing Strategy: The paper introduces a "slicing" approach (SSU) to discretize continuous responses, which reduces overfitting compared to direct distance-based weighting but still does not outperform linear supervised methods in regression.
Benchmarking: Provides a rigorous benchmark of UMAP against PCA, KPCA, SIR, KSIR, and t-SNE across diverse data distributions and real-world applications.

4. Key Results

A. Regression Tasks (Continuous Responses)

Performance: SIR and KSIR consistently achieved the lowest testing MSE across all simulated models and the real-world News Popularity dataset. They effectively captured the predictor-response relationships.
Supervised UMAP Failure:
- CoSU (direct continuous weighting) exhibited severe overfitting, yielding the highest testing MSEs in most scenarios, often performing worse than the original high-dimensional data.
- SSU (sliced) and CaSU (unique values) showed some improvement over CoSU but generally failed to outperform Unsupervised UMAP (UU) or PCA.
Conclusion: The current supervised UMAP framework does not successfully incorporate continuous response information for dimension reduction; it often distorts the manifold structure in a way that harms prediction.

B. Classification Tasks (Categorical Responses)

Performance: Supervised UMAP (CaSU) and SIR outperformed other methods on simulated data.
Real-World (Fashion-MNIST):
- Supervised UMAP achieved the best training accuracy and competitive testing accuracy (0.162 error rate), significantly outperforming Unsupervised UMAP (0.247) and linear methods (PCA/SIR > 0.49).
- Visualization: Supervised UMAP produced embeddings with clear class separation and preserved global structure, whereas unsupervised UMAP showed significant class mixing.
Conclusion: Supervised UMAP is highly effective for classification, successfully leveraging label information to enhance manifold learning.

C. Computational Efficiency

Speed: PCA and SIR were the fastest. t-SNE was the slowest (especially on real data). Supervised UMAP offered a good balance of speed and accuracy for classification.

5. Significance and Future Directions

Practical Guidance: The study provides a crucial warning for practitioners: Do not rely on standard supervised UMAP for regression tasks expecting it to outperform linear supervised methods like SIR. It may degrade performance.
Methodological Insight: The failure of supervised UMAP in regression suggests that simply weighting edges by response distance is insufficient for continuous variables. The method likely overfits to noise in the response variable rather than learning the underlying manifold structure.
Future Research: The paper highlights an urgent need to develop new theoretical frameworks or algorithmic modifications for supervised UMAP that can effectively handle continuous responses without overfitting. Future work should focus on integrating response information in a way that preserves the topological benefits of UMAP while ensuring generalization in regression settings.

Summary Conclusion:
UMAP is a superior tool for classification and unsupervised visualization, offering excellent preservation of local and global structures. However, its current supervised implementation is ineffective for regression, where traditional supervised dimension reduction methods (SIR/KSIR) remain the state-of-the-art for predictive accuracy.

A Comparative Study of UMAP and Other Dimensionality Reduction Methods

1. The Contestants (The Methods)

2. The Two Challenges: Classification vs. Regression

3. The Real-World Test

The Big Takeaway

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

A. Regression Tasks (Continuous Responses)

B. Classification Tasks (Categorical Responses)

C. Computational Efficiency

5. Significance and Future Directions

More like this

Horseshoe Priors and MDP

Observable Geometry of Singular Statistical Models

Conditional Independence under Infinite Measures and Poisson Point Processes

Sharp Debiasing for Smooth Functional Estimation in Banach Spaces

Opponent-Adjusted Evaluation of NFL Pass Blocking and Pass Rushing Performance