JZ-Tree: GPU friendly neighbour search and friends-of-friends with dual tree walks in JAX plus CUDA

This paper introduces JZ-Tree, an open-source JAX and CUDA framework that employs a GPU-optimized Morton z-order plane-based tree hierarchy to overcome thread divergence and memory access inefficiencies, achieving over an order-of-magnitude performance improvement in kk-nearest neighbour search and friends-of-friends clustering compared to existing GPU libraries.

Original authors: Jens Stücker, Oliver Hahn, Lukas Winkler, Adrian Gutierrez Adame, Thomas Flöss

Published 2026-04-08
📖 5 min read🧠 Deep dive

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are organizing a massive party with 100 million guests (points in space). Your goal is to quickly find the 16 closest friends for every single person at the party, or to group everyone into "cliques" where everyone is standing within arm's reach of someone else.

Doing this on a standard computer (CPU) is like having one very smart but slow librarian who checks every book one by one. It works, but it takes forever.

Doing this on a modern graphics card (GPU) is like having thousands of tiny, super-fast robots working simultaneously. However, there's a catch: these robots are terrible at following complex, branching instructions. If you ask them to "go left if the guest is wearing a hat, but go right if they are holding a drink," they all get confused, stop to wait for each other, and the line moves slowly. This is called thread divergence.

This paper introduces a new way to organize the party, called JZ-TREE, specifically designed so that the thousands of robots can work in perfect unison without getting confused.

The Problem: The "Twisted" Tree

Traditional methods (like KD-trees) build a decision tree that looks like a deep, twisting maze.

  • The CPU way: The librarian walks down the maze, turning left or right based on complex rules.
  • The GPU problem: When you send 1,000 robots down this maze, some hit a "turn left" sign, and others hit a "turn right" sign. The robots who want to turn left have to wait for the "turn right" group to finish, and vice versa. They also have to run all over the building to grab different books, causing traffic jams.

The Solution: The "Z-Order" Elevator

The authors propose a new way to organize the data using something called a Morton (Z-order) sort.

The Analogy:
Imagine the party is in a giant 3D room. Instead of building a complex maze, you simply ask everyone to line up in a single, long, winding line that snakes through the room like a Z-shaped path (or a snake eating its own tail).

  • People standing next to each other in the line are physically close to each other in the room.
  • This creates a flat, predictable list rather than a deep, branching tree.

How JZ-TREE Works

Once everyone is in this Z-shaped line, the authors build a "Tree of Planes" (layers of the party).

  1. The Layers: Instead of a deep tree, they create flat layers.

    • Layer 0 (The Leaves): These are groups of up to 48 people standing next to each other in the line. Crucially, the system bundles these groups so that anyone standing in the same small "Z-order cell" stays together in the same group, even if it means the group has fewer than 48 people.
    • Layer 1: Groups of those leaves combined into bigger blocks.
    • Layer 2: Even bigger blocks.
    • The Magic: Every layer has the same depth. It's like a stack of pancakes where every pancake is the same thickness. This is predictable for the robots.
  2. The Dual Walk (The Team Huddle):
    When the robots need to find neighbors, they don't wander alone. They use a Dual Tree Walk.

    • Imagine two teams of robots looking at two different layers of the party.
    • Because the data is organized in this flat, Z-order way, the robots can grab a whole chunk of data at once (like grabbing a whole shelf of books instead of one book at a time). This is called coalesced memory access.
    • They work together in groups. If one robot finds a potential friend, the whole group checks it instantly. No one is left waiting.

The Results: Speeding Up the Party

The paper tested this on two big problems:

  1. K-Nearest Neighbors (KNN): Finding the closest friends.
  2. Friends-of-Friends (FoF): Grouping people into cliques.

The Performance:

  • Old GPU methods: Were about 10 to 100 times slower than this new method for huge parties (10 million+ people).
  • The New Method (JZ-TREE): It scales beautifully. If you add more GPUs (more robots), the speed increases almost perfectly. They can process 100 billion neighbor checks in just a few seconds.

Why This Matters

This isn't just about finding friends at a party. This technology is crucial for:

  • Cosmology: Simulating how galaxies form and cluster in the universe.
  • Physics: Simulating how millions of particles interact in fluids or gases.
  • AI: Making machine learning models faster by quickly finding similar data points.

The Takeaway

The authors took a problem that was hard for GPUs (complex, branching trees) and turned it into a problem GPUs love (flat, predictable, organized lists). By arranging the data in a "Z-shape" and letting the robots work in synchronized teams, they unlocked a massive speedup, making super-complex simulations possible in a fraction of the time it used to take.

They even made the code open-source (called JZ-TREE), so anyone can use this "super-organized party planner" for their own scientific discoveries.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →