Simplex-to-Euclidean Bijections for Categorical Flow Matching

Imagine you are trying to teach a computer to understand recipes.

In the world of data, a "recipe" is often represented as a list of ingredients that must add up to 100%. If you have a cake, it might be 40% flour, 30% sugar, and 30% butter. In math, this is called a Simplex. It's a special shape where all the numbers are positive and must sum to one.

The problem is that computers are terrible at learning on this specific "recipe shape." They are used to working in Euclidean space—which is like a giant, flat, infinite grid (think of a standard graph paper or a video game world where you can walk in any direction forever).

This paper proposes a clever trick to help computers learn recipes without getting confused by the rules of the recipe shape.

The Core Idea: The "Magic Slide"

Think of the Simplex (the recipe shape) as a curved, slippery slide that ends at a wall.

The Wall: The edges of the slide represent "pure" ingredients (100% flour, 0% sugar). This is where real, discrete data lives (like a DNA letter being strictly 'A' or 'G').
The Slide: The middle of the slide is smooth and curved.

Previous methods tried to teach the computer to walk on this slippery slide. This is hard because the slide has weird geometry (it's not flat), and walking right up to the wall (the edge) is dangerous and mathematically messy.

This paper's solution is to build a "Magic Slide" (a bijection) that connects the curved recipe slide to a flat, easy-to-walk-on floor (Euclidean space).

The Transformation (The Magic Slide): The authors use a mathematical tool called the Aitchison Geometry (specifically the Isometric Logratio or Stick-breaking transforms). Imagine this as a special pair of glasses or a lens. When you look at the recipe data through this lens, the curved, slippery slide suddenly looks like a flat, normal floor.
The Training (Learning on the Floor): Now, the computer can use its standard, powerful tools (like Flow Matching) to learn how to generate new recipes. It's much easier to learn to draw a picture on a flat piece of paper than on a curved, wobbly balloon.
The Dequantization (The "Fuzzy" Recipe): There's a catch. Real recipes are discrete (you can't have 0.0001% of an egg; it's either an egg or not). But the "Magic Slide" only works on the smooth middle of the slide, not the hard edges.
- The Fix: The authors use a technique called Dirichlet Interpolation. Imagine taking a pure "100% Flour" point and gently shaking it with a little bit of "noise" so it becomes a "99% Flour, 1% other" point. This moves the data from the hard edge onto the smooth slide where the computer can learn it.
The Recovery (Taking the Glasses Off): Once the computer generates a new "recipe" on the flat floor, they use the Magic Slide in reverse to turn it back into the recipe shape. Finally, they look at the result and say, "Okay, this is 99% flour, so let's just call it 100% Flour." This is the Arg Max operation (picking the biggest number).

The Two Types of "Magic Slides"

The paper tests two specific ways to build this bridge:

The Stick-Breaking Transform (SB): Imagine you have a stick of length 1. You break off a piece for the first ingredient, then break a piece of the remaining stick for the second, and so on. This is a very intuitive, step-by-step way to turn a recipe into a flat list of numbers.
The Isometric Logratio Transform (ILR): This is a more symmetrical, "fair" way of looking at the data. It treats all ingredients equally, ensuring that the order you list them in doesn't change the math. It's like rotating the recipe so it looks the same from every angle.

Why is this better?

Simplicity: Instead of building complex, custom math tools to walk on the curved slide, they just use standard tools on a flat floor.
Accuracy: Because they respect the geometry of the recipe (using Aitchison geometry), the computer doesn't get confused about how "far apart" two recipes are.
Versatility: It works great for things like:
- DNA Sequences: Deciding if a gene is A, C, T, or G.
- Text: Predicting the next letter in a word.
- Images: Generating black-and-white pixels (which are just 0s and 1s).

The Analogy in a Nutshell

Imagine you are trying to teach a robot to navigate a circular track (the Simplex) that has a finish line at the edge.

Old Way: You try to teach the robot to drive on the curved track, dealing with the weird physics of the curve and the danger of falling off the edge.
This Paper's Way: You project the track onto a flat parking lot (Euclidean space). You teach the robot to drive on the flat lot (where it's easy). When it's done, you project the path back onto the track. If the robot ends up near the edge, you just snap it to the finish line.

The result? The robot learns faster, makes fewer mistakes, and can handle complex tasks like generating DNA or writing text, all while using the same simple tools it uses for regular, flat data.

1. Problem Statement

The paper addresses the challenge of learning and generating samples from probability distributions supported on the unit simplex ( $\Delta^D$ ). This setting is fundamental for compositional data (vectors of non-negative components summing to 1) and categorical data (discrete one-hot vectors lying on the simplex boundary).

Existing approaches face two primary difficulties:

Boundary Handling: Discrete data lies on the boundary of the simplex (where at least one coordinate is zero), which is problematic for continuous generative models that typically operate on open spaces.
Non-Euclidean Geometry: The simplex possesses a natural non-Euclidean geometry (specifically the Fisher-Rao or Aitchison geometry). Standard Euclidean models often fail to respect this geometry, leading to poor sample quality or requiring complex Riemannian machinery (e.g., geodesic calculations, exponential maps) that is computationally expensive and difficult to implement.

Current methods either operate directly on the simplex using Riemannian Flow Matching (requiring complex geometric tools) or map the simplex to the ambient Euclidean space $\mathbb{R}^K$ without respecting the intrinsic geometry, often resulting in samples that do not align with the true distribution.

2. Methodology

The authors propose Simplex-to-Euclidean Flow Matching (FM- $\mathring{\Delta}$ ), a framework that bridges the gap between discrete categorical data and continuous Euclidean generative models. The method consists of two core components:

A. Simplex-to-Euclidean Bijections

Instead of mapping the simplex to another manifold (like a sphere) requiring Riemannian operations, the authors map the open simplex ( $\mathring{\Delta}^D$ , where all $x_i > 0$ ) directly to Euclidean space ( $\mathbb{R}^D$ ) using smooth bijections derived from Compositional Data Analysis (CoDA) and Aitchison geometry.

They propose two specific transformations:

Isometric Logratio Transform (ILR):
- Defined as $z = H \log x$ , where $H$ is a Helmert matrix.
- Key Property: It is an isometry between the open simplex equipped with the Aitchison inner product and Euclidean space. This means paths traced by the flow model in Euclidean space correspond exactly to geodesics in the Aitchison geometry.
- It is order-invariant, meaning the geometry does not depend on the arbitrary ordering of categories.
Stick-Breaking Transform (SB):
- A modification of the multiplicative logratio transform, centered such that the zero vector in $\mathbb{R}^D$ maps to the centroid of the simplex.
- It is order-dependent but computationally lightweight and widely used in probabilistic modeling.

B. Dirichlet Interpolation for Discrete Data

Since the bijections only map the open simplex, discrete observations (which lie on the boundary) must be handled specially:

Training (Dequantization): Discrete one-hot vectors $c$ are stochastically interpolated into the interior of the simplex using a Dirichlet mixture:
$x = \lambda c + (1 - \lambda)\epsilon, \quad \epsilon \sim \text{Dir}(\alpha)$
The authors prove that for $\lambda \geq 1/2$ , the supports of the resulting mixture components are disjoint within their respective "arg max" regions. This allows the continuous model to learn the density of the mixture while preserving the ability to recover the original discrete category.
Inference (Recovery): Generated continuous samples $\hat{x}$ are converted back to discrete categories via a simple arg max operation: $\hat{c} = \text{arg max}(\hat{x})$ . Theoretical guarantees ensure that if $\lambda \geq 1/2$ , this recovery is exact.

C. Training and Sampling

The method utilizes Conditional Flow Matching (CFM) in Euclidean space.
Training involves mapping interpolated data points to $\mathbb{R}^D$ via the bijection $\phi$ , training a vector field $v_\theta$ to transport a base distribution (e.g., Gaussian) to the target distribution, and minimizing the standard CFM loss.
Sampling involves solving the ODE in Euclidean space and applying the inverse bijection $\phi^{-1}$ followed by the arg max operation.

3. Key Contributions

Principled Geometry Handling: The paper introduces a method that respects the Aitchison geometry of the simplex without requiring Riemannian optimization tools, leveraging standard Euclidean flow matching instead.
Exact Discrete Recovery: By combining smooth bijections with a specific Dirichlet interpolation scheme, the method enables exact recovery of discrete categories from continuous samples, avoiding the need for complex likelihood lower bounds often used in discrete diffusion.
Two Novel Bijections: The authors formalize and apply the ILR (order-invariant, isometric) and SB (order-dependent, centered) transforms specifically for categorical flow matching, demonstrating their computational efficiency.
Theoretical Guarantees: The paper provides proofs for:
- The isometry of the ILR transform.
- The bound on Total Variation distance between the true categorical distribution and the generated distribution.
- The exact recovery of categories via arg max under specific interpolation parameters.

4. Experimental Results

The method was evaluated on five tasks, comparing against state-of-the-art discrete models (e.g., D3PM, DFM) and continuous relaxations (e.g., SFM, DirichletFM, Bit-Diffusion).

Binarized MNIST: FM- $\mathring{\Delta}$ achieved the lowest Negative Log-Likelihood (NLL) and Fréchet Inception Distance (FID) among all continuous relaxation methods, outperforming SFM and LinearFM significantly.
DNA Sequence Generation: On the Promoter DNA dataset, the method achieved the best SP-MSE (performance metric based on a pretrained Sei model), outperforming both discrete and continuous baselines.
Text8: While discrete-state models (like SEDD) still held a slight edge in NLL, FM- $\mathring{\Delta}$ was the best performing continuous relaxation, demonstrating competitive entropy and NLL.
Scalability: The method scales well with the number of categories ( $K$ ). It outperforms SFM and LinearFM, particularly in medium-dimensional settings ( $K \approx 2^5$ to $2^7$ ), and remains comparable to discrete-state models up to $K=2^7$ .
Visualization: In the "Checkerboard" experiment, FM- $\mathring{\Delta}$ generated samples that aligned closely with the true density, whereas SFM and LinearFM produced many invalid samples near the vertices.

5. Significance

This work represents a significant step forward in generative modeling for categorical and compositional data. Its primary significance lies in conceptual and implementation simplicity:

It allows researchers to use standard, well-optimized Euclidean generative models (like Flow Matching or Diffusion) for discrete data without implementing complex Riemannian geometry machinery.
It resolves the tension between respecting the natural geometry of the data (Aitchison geometry) and the computational ease of Euclidean space.
It provides a unified framework that is competitive with specialized discrete models while offering the flexibility of continuous relaxations, making it highly applicable to fields like bioinformatics (DNA/protein generation), natural language processing, and geology.

In summary, the paper demonstrates that by correctly mapping the simplex to Euclidean space via Aitchison geometry and handling boundaries via Dirichlet interpolation, one can achieve state-of-the-art performance in categorical generation with a much simpler and more generalizable pipeline.

Simplex-to-Euclidean Bijections for Categorical Flow Matching

The Core Idea: The "Magic Slide"

The Two Types of "Magic Slides"

Why is this better?

The Analogy in a Nutshell

1. Problem Statement

2. Methodology

A. Simplex-to-Euclidean Bijections

B. Dirichlet Interpolation for Discrete Data

C. Training and Sampling

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank