The Theory behind UMAP?

Here is an explanation of the paper "The Theory behind UMAP?" by David Wegmann, translated into simple language with creative analogies.

The Big Picture: Fixing the Blueprint

Imagine a popular new machine called UMAP (Uniform Manifold Approximation and Projection). Data scientists love this machine because it takes a giant, messy cloud of data points (like thousands of photos or customer records) and squishes them down into a simple 2D map that humans can understand, while trying to keep the "shape" of the data intact.

When the creators of UMAP (McInnes et al.) published their paper in 2018, they also included a "theoretical blueprint" to explain why the machine works. They claimed the machine was built on a sophisticated mathematical structure called Spivak's Metric Realization.

The Problem: The author of this paper, David Wegmann, looked at that blueprint and found it was full of holes, cracks, and confusing instructions. The original blueprint was written by a mathematician named Spivak in an unpublished draft, and the UMAP creators copied it, including the mistakes.

The Mission: Wegmann's paper is like a master architect coming in to repair the blueprint. He fixes the math errors, clarifies the confusing parts, and rebuilds the theory so it actually makes sense. He wants to prove exactly how the machine works and where the original theory went wrong.

The Core Concepts (With Analogies)

1. The "Fuzzy" Membership Card

To understand UMAP, you first need to understand Fuzzy Sets.

Normal Set: Imagine a club. You are either a member (1) or you aren't (0).
Fuzzy Set: Imagine a VIP club with different levels of access. You might be a "Gold Member" (strength 1), a "Silver Member" (strength 0.5), or just "hanging around the door" (strength 0.1).
The Paper's Fix: The original UMAP theory tried to define these membership levels using a specific mathematical rule (a "topology") that was missing a crucial piece (the "empty set"). Wegmann fixes this definition so the math holds up.

2. The "Shape-Shifting" Lego Blocks

The theory uses Simplicial Sets. Think of these as Lego blocks of different shapes:

0D = A point.
1D = A line.
2D = A triangle.
3D = A pyramid.

In UMAP, these aren't just static shapes. They are Metric Simplices. This means every Lego block has a "size" or "scale" attached to it.

The Original Mistake: The original theory tried to make the Lego blocks themselves change size based on the data. Wegmann realized this was messy and caused division-by-zero errors (like trying to divide a pizza by zero slices).
The Fix: Wegmann proposes keeping the Lego blocks the same size but changing the distance between the points on the block. It's like keeping the plastic brick the same but stretching the rubber band connecting two studs. This makes the math much cleaner and avoids the errors.

3. The "Glue" (The Functor)

The core of the theory is a mathematical machine called a Functor (specifically, the "Metric Realization").

The Analogy: Imagine you have a bag of instructions (the data) and a bag of Lego bricks (the shapes). The Functor is the robot that reads the instructions and glues the Lego bricks together to build a 3D sculpture.
The Problem: The original instructions told the robot to glue things in a way that sometimes broke the laws of geometry (making distances behave strangely).
The Fix: Wegmann rewrites the robot's programming. He proves that if you use a specific type of "glue" (called the L1 metric or Manhattan distance), the robot builds a stable, non-breaking sculpture every time.

4. The "Finite" Version (The Real UMAP)

The original theory dealt with infinite possibilities, but the actual UMAP algorithm runs on computers with finite memory.

The Challenge: McInnes et al. tried to create a "Finite" version of the theory for the computer, but they left the definition of "Finite" vague. It was like saying "build a small house" without defining what "small" means.
The Fix: Wegmann defines exactly what "Finite" means in this context. He shows how to take the infinite mathematical theory and chop it down into a version that fits on a hard drive, proving that the computer algorithm is indeed a valid, finite version of the grand theory.

What Does This Mean for UMAP?

Wegmann concludes by looking at the actual UMAP algorithm steps and asking: "Does the math actually support what the algorithm does?"

The "Probability" Claim: The original paper claimed that the "weights" on the data graph (how connected two points are) act like probabilities (e.g., "There is a 90% chance these two points are neighbors").
- Wegmann's Verdict: "Not so fast." He argues that while it looks like probability, the original paper never actually proved it mathematically. It's a useful metaphor, but not a proven fact.
The "Topology" Claim: The original paper claimed UMAP preserves the "topology" (the shape) of the data.
- Wegmann's Verdict: The math he fixed does show how to build a shape that represents the data, but the claim that the algorithm perfectly preserves the shape of the original universe is still an assumption, not a proven theorem.

The Takeaway

Think of this paper as a forensic audit of a famous building.

The building (UMAP) is beautiful and popular.
The original blueprints (Spivak/McInnes theory) had structural flaws and confusing notes.
David Wegmann is the engineer who stepped in, fixed the cracks, clarified the notes, and confirmed that yes, the building is safe to stand in, but we need to be careful about what we claim it can do.

He didn't tear the building down; he just made sure the foundation is solid so that future architects (data scientists) can build on top of it without it collapsing.

Here is a detailed technical summary of the thesis "The Theory behind UMAP?" by David Wegmann.

1. Problem Statement

The paper addresses the theoretical foundations of the UMAP (Uniform Manifold Approximation and Projection) algorithm, a popular dimensionality reduction technique introduced by McInnes et al. in 2018. While UMAP is widely used, its theoretical justification relies on a "metric realization" functor derived from an unpublished draft by David Spivak (2018) and earlier work by Barr (1986) on fuzzy sets.

The author identifies that both Spivak's draft and McInnes et al.'s implementation contain significant mathematical errors, gaps, and inconsistencies, including:

Incorrect Definitions: Misdefinitions of fuzzy sets (confusing presheaves and sheaves) and topological spaces on the interval $I=(0,1]$ .
Mathematical Flaws: Use of logarithms where parameters can be 0 or 1, leading to undefined values ( $\log(0)$ ) or division by zero.
Metric Issues: The use of the Euclidean ( $\ell_2$ ) metric in the original draft fails to ensure that degeneracy maps are non-expansive (a requirement for the functor to be well-defined).
Categorical Gaps: Failure to prove that specific functors are well-defined on the required categories (e.g., verifying that images of Yoneda embeddings are sheaves).
Ambiguity: Vague definitions of "finite" and "bounded" variants of the categories used in the finite metric realization.

The thesis aims to repair these errors, provide a self-contained and rigorous derivation of the metric realization, and clarify the correspondence between the mathematical theory and the UMAP algorithm.

2. Methodology

The author employs Category Theory and Sheaf Theory to reconstruct the theory rigorously. The methodology involves:

Categorical Reconstruction: Re-defining the necessary categories (Fuzzy Sets, Normed Sets, Simplicial Objects) using precise sheaf-theoretic definitions rather than the flawed presheaf definitions found in the literature.
Correction of Functors:
- Metric Choice: Replacing the Euclidean metric with the $\ell_1$ (Manhattan) metric for simplices. The author proves that only the $\ell_1$ metric ensures that degeneracy maps are non-expansive, a critical property for the functor to exist.
- Handling Parameters: Resolving the $\log(0)$ and division-by-zero issues by re-parameterizing the metric simplices. Instead of scaling the underlying set of the simplex (which causes issues at 0 and 1), the author scales the metric itself while keeping the underlying set fixed.
Equivalence of Categories: Explicitly constructing the equivalence between Classical Valued Sets (sets with membership functions) and Sheaf-Theoretic Valued Sets (sheaves on a locale). This allows the author to switch between the intuitive "classical" view and the rigorous "sheaf" view.
Kan Extensions: Defining the Metric Realization and Finite Metric Realization as Left Kan Extensions along the Yoneda embedding. This provides a universal construction that guarantees the existence of the functors under specific cocompleteness conditions.
Finite Variants: Rigorously defining "finite" and "bounded" subcategories to reproduce McInnes et al.'s finite metric realization, proving that the necessary colimits exist even when the target category (finite extended pseudo-metric spaces) is not fully cocomplete.

3. Key Contributions

A. Rigorous Definition of Metric Realization

The author provides the first explicit, error-free construction of Spivak's metric realization functor:
$\text{MetRe}: \text{USNSet} \to \text{EPMet}$
where $\text{USNSet}$ is the category of uncurried simplicial normed sets and $\text{EPMet}$ is the category of extended pseudo-metric spaces.

Correction of Metric: Proved that the $\ell_1$ metric is necessary for non-expansiveness of degeneracy maps.
Correction of Scaling: Defined metric simplices $\Delta_{n,a}$ with a fixed underlying set but a scaled metric $a \cdot \|\cdot\|_1$ , avoiding the singularities present in Spivak's definition.

B. Explicit Classical Formulation

While the sheaf-theoretic construction is rigorous, it is abstract. The author derives the Classical Metric Realization ( $CMetRe$ ), which maps a simplicial classical normed set directly to a quotient of metric simplices:
$CMetRe(S) = \left( \coprod_{s \in S} \Delta_{s} \right) / \sim_C$
Here, the size of the simplex corresponding to an element $s$ is determined by its norm $\|s\|$ . This formulation is much more intuitive and computationally transparent than the sheaf-theoretic version.

C. Resolution of Finite Metric Realization

The author formalizes McInnes et al.'s "finite" variant, which is the theoretical basis for the UMAP algorithm:

Defined Fin-EPMet (finite extended pseudo-metric spaces) and Fin-USFuz (finite uncurried simplicial fuzzy sets).
Proved that the Left Kan extension exists in this restricted setting by showing that the colimits (quotients) of finite diagrams remain finite.
Derived the explicit formula for the finite realization, showing it corresponds to a union of finite metric simplices.

D. Critique of UMAP Theoretical Claims

In the final section, the author critically evaluates the claims made in the original UMAP paper:

Fuzzy Sets as Graphs: Confirmed that the local graphs constructed in UMAP correspond to the 1-skeleton of the finite singular nerve.
Probabilistic Interpretation: Argued that the claim "edge weights represent probabilities" lacks formal justification in the original paper. The author notes that while T-conorms (like the probabilistic sum) are used for graph unions, there is no rigorous probability space defined to support the interpretation of weights as probabilities.
Topological Preservation: Highlighted that the claim that UMAP preserves the topology of the underlying Riemannian manifold is currently an intuition without a formal theorem.

4. Results

Corrected Theory: The thesis provides a mathematically sound framework for the metric realization, fixing all identified errors in Spivak's draft and McInnes et al.'s paper.
Functorial Correspondence: Established a clear chain of equivalences:
$\text{Simplicial Normed Sets} \xrightarrow{\text{Classical Realization}} \text{Extended Pseudo-Metric Spaces}$
$\text{Simplicial Fuzzy Sets} \xrightarrow{\text{Finite Realization}} \text{Finite Extended Pseudo-Metric Spaces}$
Algorithmic Insight: The analysis reveals that the UMAP algorithm effectively constructs a finite extended pseudo-metric space from the input data via a finite metric realization, followed by an optimization step (stochastic gradient descent) to embed this space into a lower dimension.
Limitations Identified: The thesis concludes that while the construction of the intermediate graph and its metric space is mathematically sound (once corrected), the justification for why this specific construction preserves the topology of the manifold (and why specific choices like the probabilistic T-conorm are optimal) remains unproven.

5. Significance

Foundational Repair: This work is essential for the theoretical community working on topological data analysis (TDA). It removes the "black box" nature of UMAP's theoretical justification by providing a rigorous, error-free mathematical derivation.
Clarification for Practitioners: By providing the explicit classical formulas, the paper helps practitioners understand exactly what geometric operations are being performed on the data (e.g., how membership strengths translate to distances).
Future Research Direction: The thesis identifies specific gaps in the UMAP theory (e.g., the probabilistic interpretation of weights and the formal proof of topological preservation). It sets a clear agenda for future research to either prove these claims or refine the algorithm based on rigorous topological constraints.
Categorical Unification: It successfully unifies concepts from fuzzy set theory, sheaf theory, and metric geometry, demonstrating how category theory can be used to resolve inconsistencies in applied machine learning algorithms.

In summary, David Wegmann's thesis serves as a critical "patch" for the theoretical underpinnings of UMAP, transforming a collection of heuristic arguments and flawed definitions into a coherent, rigorous mathematical framework, while simultaneously offering a sober critique of the algorithm's unproven claims regarding topological preservation.