A Scalable Fast Multipole Method Poisson Solver for the… — Plain-Language Explanation

Imagine you are trying to calculate the gravitational pull of every star, planet, and cloud of gas in a massive simulation of the universe. To do this accurately, you have to figure out how every single piece of matter interacts with every other piece. If you have a billion pieces of matter, checking every single pair against each other is like trying to shake hands with every person on Earth individually—it takes way too long and crashes your computer.

This paper introduces a new, faster way to solve this "gravity math problem" for a popular astronomy software called RAMSES. The authors, Jun-Young Lee and Romain Teyssier, have built a new tool called the Fast Multipole Method (FMM) and tested it against the old standard tool, called Multigrid (MG).

Here is the breakdown of what they did and found, using simple analogies:

The Problem: The "Handshake" Bottleneck

In the old way of doing things (direct calculation), if you have $N$ objects, you have to do roughly $N^2$ calculations. If you double the number of stars, the work quadruples. This is too slow for big simulations.

Both the old method (MG) and the new method (FMM) are "smart" shortcuts that reduce the work to just $N$ (linear scaling). This means if you double the stars, you only double the work. But they get there in very different ways.

The Old Way: Multigrid (MG) – The "Relay Race"

Think of the Multigrid solver as a relay race that requires many laps.

The Process: It starts with a rough guess of the gravity, then passes that guess through a series of "sponges" (mathematical steps) that clean up the errors. It goes from fine details to a coarse overview and back again.
The Catch: To get a good answer, it has to run this relay race many times (called "V-cycles") until the errors are small enough.
The Boundary Issue: When the simulation reaches the edge of the box (the edge of the universe being simulated), the old method has to make a guess about what's outside. It uses a "fake" boundary condition (like pretending the edge is a wall). This guess isn't perfect and creates errors near the edges of the simulation.

The New Way: Fast Multipole Method (FMM) – The "One-Trip Delivery"

The new FMM solver is like a highly organized delivery service that only needs to make one trip up and one trip down a hierarchy of neighborhoods.

The Upward Trip (Gathering): Imagine grouping stars into neighborhoods, then neighborhoods into districts, then districts into cities. The algorithm gathers the "mass" of these groups into a single summary (a multipole) for each group. It does this from the smallest groups all the way up to the biggest city.
The Downward Trip (Delivering): Now, it sends the gravity information back down.
- Far Away: If a star is very far away, it doesn't need to know about every single star in a distant city; it just needs the "summary" of that city. The algorithm translates that summary into a local force.
- Close By: If a star is right next to another, the algorithm calculates the exact force between them directly.
The Benefit: It only does this one upward and one downward pass. It doesn't need to run a relay race to converge.
The Boundary Advantage: Because it calculates gravity based on the actual distribution of matter without needing to guess what's outside the box, it handles "empty space" (vacuum) boundaries perfectly. It doesn't need fake walls.

The Results: Speed vs. Accuracy

The authors ran tests to see how these two methods compared:

For Smooth Things (like gas clouds): Both methods are equally accurate.
For Sharp Things (like a single point mass): The new FMM method has a slightly "blocky" error pattern. Because it groups things into grids, the math jumps a little at the grid lines, creating a box-shaped error. The old method is smoother here.
For Empty Space: The new FMM method wins. The old method gets messy near the edges of the simulation because of its "fake wall" guesses. FMM handles isolated systems (like a single galaxy in a void) much better.
Speed and Scaling:
- The Math Count: Theoretically, the new FMM method does about 30 times more math operations (floating-point operations) than the old method.
- The Real-World Speed: Surprisingly, they run at almost the same speed on a single computer core. Why? Because the new method does "heavier" math that keeps the computer's brain (CPU) very busy, while the old method spends a lot of time waiting for data to move around.
- The Multi-Core Winner: When using many computer cores (MPI ranks) together, the new FMM method scales much better. The old method gets bogged down because it has to talk to other cores constantly during its many relay laps. The new method talks less and works more, making it faster as you add more computers.

The Verdict

The authors conclude that while the new FMM method does more raw math, it is more efficient because it keeps the computer's processor busy and avoids the communication delays that slow down the old method.

Best for: Simulations of isolated systems (like a single galaxy in a void) where the old method struggles with edge errors.
Best option: They found that a specific setting of the new method (called "FMM-1") is the sweet spot. It is just as accurate as the more complex setting but runs faster.

What's Next?
This paper is the first part of a series. The authors are currently working on adapting this new method to handle Adaptive Mesh Refinement (AMR). This means the simulation can have some areas that are super detailed (zoomed in) and others that are blurry (zoomed out), and the new method will be able to handle the different time steps required for those different zoom levels.

In short: They built a new, one-trip delivery system for gravity that is just as accurate as the old multi-lap race, handles empty space better, and scales up to massive supercomputers more efficiently.

Problem Statement

Accurately and efficiently solving the gravitational interaction in $N$ -body and particle-mesh (PM) simulations is critical for modeling structure formation in the universe. While direct summation offers high fidelity, its $O(N^2)$ complexity is prohibitive for large systems. Existing linear-complexity ( $O(N)$ ) solvers, such as Multigrid (MG) methods, are widely used in adaptive mesh refinement (AMR) codes like RAMSES. However, MG solvers are iterative, requiring multiple V-cycles through a grid hierarchy to converge, and often rely on approximate Dirichlet boundary conditions for isolated systems, which can introduce errors near domain boundaries. Conversely, the Fast Multipole Method (FMM) is an $O(N)$ algorithm that performs a single upward and downward pass through a hierarchy, theoretically offering better scalability for isolated boundary conditions, but it has seen limited systematic benchmarking within pure PM or AMR codes compared to direct $N$ -body solvers.

Methodology

The authors implemented a scalable FMM solver within the RAMSES code, specifically designed for unigrid configurations with isolated (vacuum) boundary conditions. The implementation constructs a secondary hierarchy of FMM grids on top of the existing Cartesian grid used for hydrodynamics.

Key Algorithmic Components:

Hierarchy Construction: An FMM hierarchy is built with a configurable level offset ( $\Delta\ell$ ) relative to the finest AMR grid. The coarsest FMM grid fills the computational domain.
Upward Pass (Multipole Accumulation):
- P2M (Particle-to-Multipole): Masses from leaf cells (deposited via Cloud-in-Cell or TSC schemes) are converted into multipole moments.
- M2M (Multipole-to-Multipole): Multipoles are aggregated from leaf cells up to the root. The implementation retains terms up to the quadrupole order ( $n=2$ ), requiring 10 elements per cell in 3D.
- Shifting: Multipoles are shifted from the global origin to the center of each FMM cell to maintain a fixed interaction geometry, facilitating pre-computation of coefficients.
Interaction List & Field Decomposition: The gravitational field is decomposed into far-field, intermediate-field, and near-field contributions relative to a target cell.
- Far-field: Handled by local expansions propagated from parent cells.
- Intermediate-field: Calculated via Multipole-to-Local (M2L) translations for well-separated cells defined by a rigid interaction list.
- Near-field: Resolved via direct pairwise summation (P2P) at the finest level.
Downward Pass (Local Expansion & Direct Summation):
- M2L: Translates multipole expansions of source cells into local expansions for the target cell (retained up to third order, $p=3$ ).
- L2L (Local-to-Local): Propagates local expansions from parent to child cells using Taylor expansions.
- L2P & P2P: Evaluates the final potential at cell centers using local expansions for far/intermediate fields and direct summation for the near field. A softened Green's function is used for the direct summation to handle cell self-interaction.

The authors deliberately chose a rigid interaction geometry (fixed opening angles) rather than adaptive criteria to leverage pre-computed translation kernels and reduce conditional branching, anticipating future GPU acceleration.

Key Contributions

Implementation: The first systematic implementation of an FMM Poisson solver specifically integrated into the RAMSES code framework, distinct from existing libraries or direct $N$ -body codes.
Benchmarking: A direct "apples-to-apples" comparison between the FMM solver and the standard MG solver in RAMSES, focusing on accuracy and scaling performance.
Boundary Condition Analysis: Demonstration that FMM is particularly well-suited for isolated systems, avoiding the boundary errors inherent in MG schemes that rely on approximate Dirichlet conditions.
Performance Characterization: Detailed analysis showing that while FMM has a higher theoretical floating-point operation (FLOP) count (approx. 30 times that of MG), its higher arithmetic intensity leads to comparable single-core performance and superior parallel scaling due to reduced MPI communication frequency (single pass vs. multiple V-cycles).

Results

Accuracy:
- For smooth density profiles (e.g., two uniform spheres, NFW halos), FMM achieves accuracy comparable to MG.
- For discrete density fields (e.g., a single point charge), FMM exhibits larger errors and characteristic "boxy" error patterns caused by discontinuities in local expansions across cell boundaries. However, the authors note that for extended density distributions relevant to astrophysics, these errors are less pronounced.
- Boundary Performance: FMM significantly outperforms MG near the boundaries of isolated systems, where MG errors increase due to approximate boundary conditions.
- Parameter Sensitivity: The difference in accuracy between $\Delta\ell=1$ (FMM-1) and $\Delta\ell=2$ (FMM-2) is negligible. FMM-1 is identified as the optimal configuration.
Scalability:
- Strong Scaling: FMM-1 scales better than MG and FMM-2, maintaining power-law behavior up to 128 MPI ranks before saturation.
- Weak Scaling: FMM-1 demonstrates superior efficiency compared to both standard and fully optimized MG solvers.
- Communication Overhead: The single-pass nature of FMM results in fewer MPI communications compared to the iterative V-cycles of MG, leading to better scalability despite the higher FLOP count. The authors attribute the similar single-core performance to the fact that both solvers are memory-bound, where FMM's higher arithmetic intensity provides an advantage.

Significance and Claims

The paper claims that the FMM solver provides a scalable, linear-complexity alternative to MG for the RAMSES code, particularly advantageous for problems with isolated boundary conditions. The authors emphasize that while FMM theoretically requires more operations, its algorithmic structure (high arithmetic intensity, reduced communication) makes it competitive in performance and superior in scalability on modern heterogeneous architectures.

The work serves as a prelude to a future implementation of FMM in full AMR simulations with adaptive time stepping (Lee and Teyssier 2026, in prep.). The authors note that the current unigrid implementation is a necessary step to validate the algorithm before extending it to the more complex, non-uniform grid structures and adaptive time-stepping requirements of full cosmological simulations. They also highlight that the "boxy" error patterns are an intrinsic limitation of the current low-order expansion but can potentially be mitigated by higher-order multipoles or random affine transformations in future work.

A Scalable Fast Multipole Method Poisson Solver for the RAMSES code: I. Unigrid Algorithm