Torus embeddings

Imagine you are trying to organize a massive library of information. In the world of Artificial Intelligence (AI), this "library" is made of embeddings—mathematical maps that turn complex data (like images of cats or songs of birds) into lists of numbers so computers can understand them.

For a long time, AI researchers have been organizing these lists of numbers in two main ways:

The Infinite Room (Euclidean Space): You can put the numbers anywhere, but they can get lost or drift too far apart.
The Giant Bubble (Hypersphere): You force all the numbers to sit on the surface of a giant, invisible ball. This keeps them organized and close together, which is great for finding similar items.

The Problem with the Bubble
While the "Giant Bubble" works well for training AI, it's a nightmare for the actual computers that run these models in the real world (like your phone, a smart thermostat, or a tiny sensor).

Why? Because the numbers on a bubble are messy decimals (like 3.14159...). Most everyday computers, especially the tiny, low-power ones, are built to handle simple whole numbers (integers) very efficiently. When you try to squeeze the messy "bubble" numbers into simple whole numbers, you lose a lot of detail, like trying to fit a high-definition photo into a pixelated 8-bit video game. It's inefficient and wastes the computer's potential.

The Solution: The Donut (Torus)
Dan Stowell, the author of this paper, suggests a new way to organize the library: The Donut (or Torus).

Think of a video game like Pac-Man or Asteroids. If you walk off the right edge of the screen, you instantly reappear on the left. If you walk off the top, you reappear at the bottom. The world wraps around itself.

In math, this is called a Torus.

The Analogy: Instead of a sphere where you have to deal with tricky curves, imagine a flat square grid where the edges are glued together.
Why it's great for computers: Computers are already built to handle "wrapping around." If you add two numbers and the result is too big, the computer just "wraps around" to the start (like a clock going from 12 back to 1). This is called overflow arithmetic. It's the fastest, most basic thing a computer can do.

By designing the AI's map to be a "Donut" instead of a "Bubble," the data fits perfectly into the computer's native language (simple whole numbers) without needing complex conversions.

How They Tried It
The author tested two ways to make this "Donut map":

The "Clifford" Method: This was like trying to fold a piece of paper into a donut shape using complex origami. It worked, but it was unstable and sometimes the AI got confused and crashed.
The "Pairwise Normalization" Method: This was like taking pairs of numbers and gently twisting them into a circle. This method was stable, easy to train, and performed just as well as the traditional "Bubble" method.

The Results

Performance: The "Donut" maps worked just as well as the "Bubble" maps for recognizing cats, dogs, and bird songs.
Efficiency: When they compressed the data to be tiny (using very few bits, like 1 or 8 bits), the "Donut" maps held their shape better. They didn't lose as much detail as the "Bubble" maps did.
The Future: This is a big deal for TinyML (AI on tiny devices). If you want to put a smart AI on a battery-powered sensor in a forest to listen for birds, you don't have a supercomputer. You have a simple chip. The "Donut" method allows these simple chips to run powerful AI models efficiently because they speak the computer's native language.

In a Nutshell
The paper argues that instead of forcing AI data into a complex, curved shape (a sphere) that doesn't fit well with simple computer chips, we should shape the data like a donut. This shape naturally fits the way computers count and wrap around, making AI faster, more efficient, and perfect for running on the small, everyday devices that surround us.

1. Problem Statement

Deep learning embeddings are typically represented as vectors in either unconstrained Euclidean space or constrained hyperspherical space. While hyperspherical embeddings offer benefits for distance-based search, they present a fundamental mismatch with the most efficient and ubiquitous numeric representation in modern computing: 8-bit integers with overflow (modulo) arithmetic.

The Mismatch: Standard computers handle integer overflow naturally (wrap-around), which mathematically corresponds to a torus (or hypertorus) topology, not a hypersphere.
Inefficiency: Representing hyperspherical data on integer hardware requires complex coding-decoding schemes or non-standard quantization, leading to wasted representation capacity and computational overhead.
The Goal: To create deep learning embeddings that natively align with the topology of integer overflow arithmetic, enabling efficient "TinyML" deployment on standard CPUs without sacrificing representational quality.

2. Methodology

The paper proposes adapting standard deep learning pipelines to generate embeddings with hypertoroidal geometry. The authors explore two primary strategies to map data onto a torus, utilizing the Clifford projection to facilitate training.

A. Topological Foundations

Flat Torus: A vector of integers with modulo arithmetic naturally defines a flat, square torus.
Clifford Torus: To enable efficient training using standard gradient descent and cosine distance (avoiding the combinatorial complexity of calculating geodesics on a flat torus), the authors map data to a Clifford hypertorus. This is a subspace of a hypersphere where every pair of dimensions has a fixed $L_2$ norm.
Distance Metric: In the Clifford space, cosine distance serves as an efficient proxy for distance, similar to hyperspherical embeddings, while maintaining a 1-to-1 distortion-free mapping back to the flat torus for inference.

B. Two Projection Strategies

The paper compares two methods to project raw $D$ -dimensional data onto the Clifford hypertorus:

Clifford Projection (torusC):
- Maps input $(x_1, \dots, x_D)$ to $(\sin x_1, \cos x_1, \dots, \sin x_D, \cos x_D)$ .
- Dimensionality: Doubles the extrinsic dimension ( $2D$ ) but keeps intrinsic dimension $D$ .
- Issue: Observed to be unstable during training, prone to divergence with large gradient updates due to the "wrapping" nature of the space.
Pairwise $L_2$ Projection (torusN):
- Applies $L_2$ normalization pairwise to dimensions: $(x_1, x_2) \to \frac{\sqrt{2}}{\sqrt{D}} \frac{(x_1, x_2)}{\|(x_1, x_2)\|_2}$ .
- Dimensionality: Keeps extrinsic dimension $D$ but reduces intrinsic dimension to $D/2$ .
- Advantage: Demonstrated to be more stable and comparable to standard hyperspherical training.

C. Training Stabilization

KoLeo Regularization: Used to encourage uniform distribution of data points across the space (maximizing differential entropy), preventing clustering.
Gradient Clipping: Essential for torusC to prevent divergence caused by large updates wrapping around the torus multiple times. torusN is more robust but still benefits from clipping.

D. Inference and Quantization

Inference: Representations can be converted from Clifford space to Flat Torus space using arctan2. This allows distance calculations to be performed using simple integer subtraction with overflow (wrap-around), which is extremely fast on CPUs.
Quantization: The paper evaluates Grid Quantization (direct mapping to integer grids) and Product Quantization (PQ). The flat torus geometry is theoretically ideal for grid quantization.

3. Key Contributions

Topological Alignment: Proposes the hypertorus as the natural topology for deep embeddings when targeting integer-based hardware, aligning the data representation with the fundamental arithmetic of general-purpose CPUs.
Training Methods: Introduces and compares two projection methods (torusC and torusN). Demonstrates that the Pairwise $L_2$ projection (torusN) is the superior strategy, offering stability and performance comparable to hyperspherical embeddings.
Quantization Analysis: Shows that toroidal embeddings maintain high fidelity under extreme low-bitrate quantization (down to 1-bit) and are naturally suited for integer overflow arithmetic.
KoLeo Regularization: Validates the use of KoLeo regularization for training hypertoroidal spaces, noting its specific impact on stability and data spread.

4. Experimental Results

The authors evaluated the methods on CIFAR-10/100 (image classification) and the BIRB dataset (few-shot bird song audio classification).

Full-Resolution Performance:
- torusN (Pairwise $L_2$ ) achieved performance comparable to standard hyperspherical embeddings across all dimensions.
- torusC (Clifford projection) was unstable at low dimensions and generally underperformed without aggressive regularization.
Quantization Performance:
- 8-bit Quantization: Minimal performance drop for both hyperspherical and toroidal embeddings.
- Extreme Compression (1-bit / PQ): Surprisingly, hyperspherical embeddings performed as well as or better than toroidal embeddings in many quantized scenarios. The authors note that while the flat torus is theoretically better for grid quantization, Product Quantization (PQ) is robust enough to handle hyperspherical data effectively even at high compression.
- Low-Dimensionality: In low-dimensional settings (e.g., 16D or 32D) combined with quantization, torusN often showed a slight edge or comparable performance to hyperspheres, suggesting a potential bottleneck interaction where the torus topology handles low-bit constraints better.
Few-Shot Learning: On the BIRB audio dataset, torusN and hyperspherical embeddings yielded similar results, with torusN showing slight advantages in lower dimensions (16D/32D) for 1-shot and 5-shot tasks.

5. Significance and Implications

TinyML and Edge AI: The primary value of Torus Embeddings lies in their implementation efficiency. They allow deep embeddings to be stored and processed using standard 8-bit integers with native overflow behavior, eliminating the need for floating-point units or complex quantization schemes on resource-constrained devices.
Hardware Agnosticism: This approach enables efficient deployment on a vast range of hardware, from modern GPUs to legacy CPUs and microcontrollers, without requiring specialized instruction sets.
Paradigm Shift: The paper challenges the assumption that hyperspheres are the only viable bounded topology for embeddings. It suggests that for the future of AI—where massive foundation models are distilled into small, efficient representations for ubiquitous devices—topology-aware design (matching the math to the hardware) is crucial.
Future Work: The authors suggest that the matching topology between torus embeddings and binary neural networks (BNNs) could open new avenues for distilling toroidal representations into binary models.

In conclusion, while Torus Embeddings do not universally outperform hyperspherical embeddings in raw accuracy, they offer a comparable performance profile with a distinct advantage in computational efficiency and hardware compatibility, making them a highly promising candidate for scalable, low-power AI applications.

Torus embeddings

1. Problem Statement

2. Methodology

A. Topological Foundations

B. Two Projection Strategies

C. Training Stabilization

D. Inference and Quantization

3. Key Contributions

4. Experimental Results

5. Significance and Implications

More like this

Convolutional Surrogate for 3D Discrete Fracture-Matrix Tensor Upscaling

Generating Counterfactual Patient Timelines from Real-World Data

LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning

SIEVE: Sample-Efficient Parametric Learning from Natural Language

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models