mlx-vis: GPU-Accelerated Dimensionality Reduction and Visualization on Apple Silicon

Imagine you have a massive library containing 70,000 books, but they are all stacked in a chaotic, 784-dimensional tower that no human can climb. You want to see the patterns: which books are similar, which are outliers, and how they group together. To do this, you need to flatten that tower into a simple, 2D map (like a flat floor plan) where similar books sit next to each other. This process is called Dimensionality Reduction.

For years, doing this on a computer has been like trying to move that library using a single, slow-moving snail (the CPU), even if you have a fleet of super-fast race cars (the GPU) sitting right next to you.

Enter mlx-vis, a new tool created by Han Xiao at Jina AI. Think of it as a magical, high-speed conveyor belt built specifically for Apple Silicon computers (like the M3 Ultra).

Here is how it works, broken down into simple concepts:

1. The "All-in-One" Toolbox

Usually, if you want to organize your data, you need to download different tools for different jobs. One tool for sorting, another for mapping, another for drawing. They often speak different languages (dependencies) and run on the slow snail (CPU).

mlx-vis is like a Swiss Army Knife that does everything in one place. It handles six different ways to organize your data (like UMAP, t-SNE, and others) and finds the nearest neighbors for you. The best part? It speaks only one language: MLX, Apple's own language for talking directly to the computer's graphics chip.

2. The Race Car vs. The Snail

On Apple computers, the brain (CPU) and the graphics card (GPU) share the same memory. It's like having a kitchen and a dining room in the same room.

Old Tools: They act like a waiter who runs back and forth between the kitchen and the dining room, carrying plates one by one. This wastes time.
mlx-vis: It acts like a chef who cooks and serves right at the table. Because it lives entirely on the "race car" (the GPU), it doesn't have to run back and forth. It processes all 70,000 data points simultaneously.

The Result?

Old Way: It might take nearly a minute to organize the data.
mlx-vis: It does the same job in 3 to 4 seconds. That's a 15x speedup for some methods!

3. The "Paintball" Renderer (Visualization)

Once the data is organized, you need to see it. Usually, computers draw these maps using a slow, step-by-step painter (like Matplotlib). If you want to make a movie showing the data organizing itself, it takes forever.

mlx-vis uses a technique called "Circle Splatting."
Imagine instead of a painter, you have a machine that shoots thousands of tiny, colored paintballs at a wall at once.

It calculates where every dot goes.
It "splats" them onto the screen using the graphics card.
It blends the colors instantly.

Because it does this on the graphics card, it can create a smooth, 800-frame animation of the data organizing itself in just 1.4 seconds. It's like watching a time-lapse video of a flower blooming, but it happens instantly.

4. Why This Matters

Before this, if you wanted to explore complex data on an Apple computer, you were stuck waiting for the "snail" to do the work.

mlx-vis removes the wait.
It gets rid of the need for complicated, clunky software dependencies.
It turns data exploration from a "wait and see" activity into an interactive experience. You can tweak the settings, hit "run," and see the results before you can finish your coffee.

In a Nutshell

mlx-vis is a super-fast, Apple-optimized engine that takes messy, high-dimensional data, organizes it into a clear 2D map, and paints a beautiful, animated picture of it—all in the time it takes to blink. It turns a slow, frustrating process into a lightning-fast, interactive visual experience.

1. Problem Statement

Dimensionality reduction (DR) is essential for exploratory data analysis, yet existing solutions face two primary limitations, particularly on Apple Silicon hardware:

Fragmentation and Dependency Bloat: State-of-the-art methods (e.g., UMAP, t-SNE, PaCMAP) are distributed across independent Python packages with heterogeneous and heavy dependencies (e.g., numba, Cython, scipy, pynndescent). This complicates installation and maintenance.
Underutilization of Hardware: Most existing libraries are CPU-bound. Even on Apple Silicon, where the CPU and GPU share unified memory, these tools perform gradient updates and neighbor searches on the CPU, leaving the Metal GPU's substantial compute capacity unused. Furthermore, visualization often relies on matplotlib, which is not GPU-accelerated and creates a bottleneck in the rendering pipeline.

2. Methodology

The authors propose mlx-vis, a Python library that reimplements the entire DR and visualization pipeline using MLX, Apple's array framework designed for Metal GPUs.

Core Architecture

Pure MLX Implementation: The library implements six DR methods and a k-nearest neighbor (k-NN) graph algorithm entirely in MLX, eliminating external dependencies like scipy or numba.
Unified API: All methods follow a consistent fit_transform(X) interface.
Pipeline Stages:
1. Preprocessing: PCA and data normalization.
2. Graph Construction: Approximate k-NN search using NNDescent.
3. Optimization: Iterative gradient-based optimization to generate 2D embeddings.
4. Rendering: GPU-native visualization.

Algorithmic Implementations

The library covers six major algorithmic families:

Neighbor Embedding: UMAP, t-SNE.
Triplet-based: PaCMAP, TriMap.
Hybrid/Contrastive: DREAMS (t-SNE + PCA), CNE (Contrastive Neighbor Embedding).
NNDescent: Implemented entirely in MLX for k-NN graph construction. It uses GPU matrix multiplication for distance calculations ( $\|a-b\|^2 = \|a\|^2 + \|b\|^2 - 2a^\top b$ ) and mx.argpartition for efficient top-k selection without full sorting.

Technical Optimizations

Lazy Evaluation & JIT Compilation: MLX builds a computation graph and dispatches work only upon mx.eval(). The @mx.compile decorator is used to JIT-compile hot loops (e.g., UMAP's SGD, t-SNE's repulsive forces) into fused GPU kernels, eliminating Python-level overhead.
GPU-Native Rendering: Instead of matplotlib, the library implements a circle-splatting renderer in MLX.
- It calculates pixel offsets with linear falloff weights.
- It uses atomic scatter-add operations (mx.array.at[idx].add(vals)) to accumulate premultiplied color contributions into a framebuffer.
- It employs a double-buffering scheme to overlap GPU rendering of frame $n+1$ with I/O of frame $n$ .
- Frames are piped directly to ffmpeg for hardware-accelerated H.264 encoding.

3. Key Contributions

First Pure MLX DR Library: The first implementation of major DR algorithms (UMAP, t-SNE, etc.) and NNDescent entirely within the MLX framework, targeting Apple Silicon.
End-to-End GPU Acceleration: Unlike previous tools, mlx-vis accelerates the entire pipeline from raw data to rendered video, including neighbor search, optimization, and visualization.
Dependency Minimization: The library depends only on MLX and NumPy, removing the need for scipy, sklearn, numba, or Cython extensions.
High-Performance Visualization: Introduces a purpose-built GPU renderer capable of producing smooth, publication-quality animations without CPU bottlenecks.

4. Experimental Results

The library was benchmarked on Fashion-MNIST (70,000 points, 784 dimensions) using an Apple M3 Ultra (512 GB unified memory).

Embedding Speed: mlx-vis significantly outperforms CPU-based reference implementations running on all available cores:
- UMAP: 2.6× speedup (3.23s vs. 8.52s).
- t-SNE: 15.5× speedup (3.78s vs. 58.62s).
- PaCMAP: 3.1× speedup.
- TriMap: 6.0× speedup.
Rendering Speed: Generating an 800-frame animation at 1000×1000 resolution takes 1.43 seconds.
Total Pipeline Time: The time from raw data to a rendered video file ranges from 3.6 to 5.2 seconds.
Quality: Visualizations are comparable to reference implementations as the objective functions and optimization schedules remain unmodified.

5. Significance

Hardware Efficiency: By leveraging Apple Silicon's unified memory and Metal GPU, mlx-vis eliminates the latency of CPU-GPU data transfers, fully utilizing available compute resources.
Interactive Exploration: The sub-second rendering of high-frame-rate animations enables real-time, interactive exploration of embedding trajectories, a capability previously absent in standard DR toolkits.
Accessibility: By simplifying the dependency tree and providing a unified API, the library lowers the barrier to entry for using advanced DR methods on Apple devices.
Future Direction: It demonstrates the viability of using MLX for complex, iterative scientific computing tasks beyond just deep learning model training, paving the way for a more integrated ecosystem on Apple Silicon.