A Comparative Study of UMAP and Other Dimensionality Reduction Methods

This paper presents a comprehensive comparative analysis of UMAP against other dimensionality reduction methods, revealing that while supervised UMAP excels in classification tasks, it currently struggles to effectively incorporate response information for regression applications.

Guanzhe Zhang, Shanshan Ding, Zhezhen Jin

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you have a massive library filled with millions of books. Each book is described by thousands of details: the author's name, the font size, the color of the cover, the number of pages, the smell of the ink, the temperature of the room it was written in, and so on. This is what data scientists call high-dimensional data. It's too much to hold in your head at once, and it's hard to find patterns.

Dimensionality Reduction is like hiring a super-smart librarian who can summarize all those thousands of details into just a few key tags (like "Mystery," "Romance," or "Sci-Fi") so you can easily organize and find what you need.

This paper is a "taste test" comparing different librarians (algorithms) to see who does the best job. The main star of the show is a new, trendy librarian called UMAP (Uniform Manifold Approximation and Projection). The authors wanted to see if UMAP is as good at organizing books when you give it a specific goal (like "find me all the mystery novels") compared to when it just organizes them on its own.

Here is the breakdown of the study using simple analogies:

1. The Contestants (The Methods)

The paper pits UMAP against some old-school and some other new librarians:

  • PCA (Principal Component Analysis): The "Old School" librarian. It's fast and simple, but it only looks for the biggest, most obvious differences. It's like saying, "All books with red covers go here." It misses subtle, complex patterns.
  • t-SNE: The "Artistic" librarian. It's amazing at grouping books that look very similar (local patterns) but sometimes loses the big picture of how the whole library is arranged (global structure). It's also slow and can't easily organize new books that arrive later.
  • SIR (Sliced Inverse Regression): The "Targeted" librarian. It's a specialist that only works if you tell it what you are looking for (the "response"). It's great at finding the specific path to the answer.
  • UMAP (The Star): The "Modern" librarian. It's famous for being fast and great at keeping both the small details and the big picture intact.

2. The Two Challenges: Classification vs. Regression

The researchers tested these librarians in two different scenarios:

Scenario A: The Sorting Game (Classification)

  • The Task: "Sort these books into 10 distinct piles based on genre." (The answer is a category: Mystery, Romance, etc.)
  • The Result: UMAP won. When the librarian was told, "Hey, these books are all Mysteries, keep them close together," UMAP did an incredible job. It created a map where all the Mysteries were in one corner and all the Romances in another. It was even better than the old-school methods.
  • Analogy: If you ask UMAP to sort your friends into "Sports," "Art," and "Music" groups, it does a fantastic job of clustering them perfectly.

Scenario B: The Prediction Game (Regression)

  • The Task: "Predict the exact price of a house based on its features." (The answer is a number, like $450,200.50).
  • The Result: UMAP stumbled. This is the big surprise. When the researchers told UMAP, "Hey, these houses cost $500k and those cost $600k, keep them close in that order," UMAP got confused. It actually performed worse than when it was left to its own devices (Unsupervised).
  • The Problem: It seems UMAP tried to memorize the exact prices too hard, like a student cramming for a test by memorizing the answer key instead of learning the concept. When it tried to predict the price of a new house, it failed because it had "overfit" (memorized the training data too specifically).
  • The Winner: The "Targeted" librarian (SIR) actually did the best here. It knew how to use the price information to find the right path without getting confused.

3. The Real-World Test

The researchers didn't just use made-up data; they tested on real things:

  • Fashion Photos: They tried to sort photos of clothes (T-shirts vs. boots). UMAP crushed it. It separated the clothes perfectly.
  • News Popularity: They tried to predict how many times a news article would be shared (a number). SIR won again. UMAP struggled to use the "number of shares" to help organize the articles effectively.

The Big Takeaway

Think of UMAP as a brilliant artist who is amazing at drawing a map of a city where all the coffee shops are close together and all the parks are in a big green zone. If you ask it to sort things into categories (Coffee vs. Parks), it's the best in the world.

However, if you ask it to draw a map where the distance between points represents a specific number (like "this house is exactly $50,000 more expensive than that one"), it gets a bit lost. It tries to force the numbers into its artistic style and ends up making mistakes.

The Conclusion:

  • For Sorting/Categorizing (Classification): Use Supervised UMAP. It's a powerhouse.
  • For Predicting Numbers (Regression): Be careful with Supervised UMAP. It currently struggles to use the "answer" to help organize the data. The older, more specialized methods (like SIR) are currently more reliable for this job.

The authors are essentially saying: "UMAP is a fantastic tool, but we need to teach it a new trick before it can master the art of predicting numbers."

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →