Original authors: Bruno Golosio, Gianmarco Tiddia, José Villamar, Luca Pontisso, Luca Sergi, Francesco Simula, Pooja Babu, Elena Pastorelli, Abigail Morrison, Markus Diesmann, Alessandro Lonardo, Pier Stanislao Paolucc

Published 2026-05-18

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Bruno Golosio, Gianmarco Tiddia, José Villamar, Luca Pontisso, Luca Sergi, Francesco Simula, Pooja Babu, Elena Pastorelli, Abigail Morrison, Markus Diesmann, Alessandro Lonardo, Pier Stanislao Paolucci, Johanna Senk

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine trying to simulate the human brain on a computer. The brain is a massive city of about 86 billion neurons, where each neuron is a house sending tiny electrical "text messages" (called spikes) to thousands of other houses every second. To simulate this, you need a supercomputer with thousands of graphics cards (GPUs) working together.

The problem is that these GPUs are like islands. They are fast, but they don't talk to each other easily. If one island wants to send a message to another, the "mailman" (the communication system) has to run back and forth, which slows everything down.

This paper introduces a new, much faster way to build the map of these connections before the simulation starts, so the GPUs can run the simulation without getting stuck in traffic.

Here is how they did it, explained simply:

1. The Old Way: Building the Map on the Mainland

Previously, when scientists wanted to simulate a brain network, they built the "connection map" on the slow, central computer (the CPU) first. Then, they had to copy this massive map over to the fast GPUs.

The Analogy: Imagine you are organizing a massive party. In the old method, you wrote down every single guest's name and who they know on a piece of paper in the kitchen (CPU), then ran to every single room (GPU) to hand them a copy of the list. This took a long time just to get ready.

2. The New Way: Building the Map Inside the Rooms

The authors developed a new method where each GPU builds its own part of the connection map directly inside its own memory, without waiting for the central computer.

The Analogy: Now, instead of writing the list in the kitchen, every room has its own notepad. As soon as the party starts, the guests in each room write down who they know right there. No running back and forth to the kitchen is needed.
The Result: This "onboard" construction is more than 10 times faster than the old way. In one test, it took 55 seconds to build the network instead of nearly 12 minutes.

3. Two Ways to Send Messages

Once the map is built, the GPUs need to exchange the "text messages" (spikes) during the simulation. The paper tested two different strategies for this, depending on how the network is organized:

Strategy A: The Direct Phone Call (Point-to-Point)
- How it works: If a neuron in GPU #1 needs to talk to a specific neuron in GPU #2, it calls that specific GPU directly.
- Best for: Networks where connections are uneven or specific (like a real brain where some areas talk a lot to each other, but not to everyone).
- The Paper's Claim: They used this for a model of the monkey's visual cortex (32 different areas). It worked perfectly, proving the new map-building method is compatible with complex, real-world brain structures.
Strategy B: The Group Chat (Collective Communication)
- How it works: Instead of calling individuals, a GPU shouts its messages to a whole group of GPUs at once. Everyone in the group hears the shout and checks if the message is for them.
- Best for: Huge, random networks where everyone talks to everyone (like a balanced crowd).
- The Paper's Claim: They tested this on a massive "balanced network" scaling up to 1,024 GPUs. This is a huge number of graphics cards working together. They showed that even with this many cards, the system scales up smoothly without crashing.

4. The "Memory Levels" Trick

GPUs have a lot of memory, but not infinite. Storing the connection maps for billions of neurons takes up a lot of space.

The Analogy: Imagine you have a small desk (GPU memory) and a huge warehouse (CPU memory).
The Solution: The authors created four "levels" of organization.
- Level 0: Keep the maps in the warehouse (CPU) and only bring what you need to the desk. This saves desk space but is slower to fetch.
- Level 3: Fill the desk with everything. This is the fastest but requires a bigger desk.
The Paper's Claim: They showed that by choosing the right level, they could run simulations on the Leonardo Booster supercomputer (which has 4,096 GPUs) and even predict that the upcoming JUPITER supercomputer could simulate a network with 230 million neurons and 2.5 trillion synapses. That is roughly the size of the human cortex!

Summary of What They Achieved

Speed: They made the "setup" phase of brain simulations 10x faster by building the network map directly on the graphics cards.
Scale: They proved this works on up to 1,024 GPUs simultaneously.
Flexibility: They showed two different ways to handle communication (direct calls vs. group chats) so scientists can choose the best method for their specific brain model.
Future Proof: Their methods are designed to work on the next generation of "Exascale" supercomputers, which will be powerful enough to simulate a full human brain with individual synapse details.

In short, they didn't just make the simulation run faster; they built a better "road system" for the data so the supercomputer doesn't get stuck in traffic before the race even begins.

Technical Summary: Scalable Construction of Spiking Neural Networks using up to thousands of GPUs

Problem Statement

Simulating large-scale Spiking Neural Networks (SNNs) at the scale of the human cerebral cortex presents two primary challenges: substantial memory requirements for individual neurons and synapses, and the need for high processing speeds to resolve dynamics with sub-millisecond precision. While High-Performance Computing (HPC) systems equipped with thousands of GPUs offer the necessary computational density, existing GPU-based simulation software has not yet demonstrated the ability to scale to entire compute clusters while meeting the infrastructure and accuracy demands of computational neuroscience.

A specific bottleneck in distributed simulations of large point-neuron networks is the communication of spikes between different nodes of a compute cluster. Previous approaches, such as the Digital Brain or GeNN, either omit individual synapse information or are limited to single-GPU execution. Furthermore, traditional CPU-based simulators like NEST rely on round-robin neuron distribution and collective communication, which assumes homogeneous network structures and fails to exploit the topological and spatial heterogeneity of biological brains. While NEST GPU has addressed some of these issues, its initial network construction relied on transferring data from CPU to GPU memory, and dynamic construction methods were previously limited to single-GPU simulations.

Methodology

This work presents a novel, memory-efficient method for constructing and simulating large-scale SNNs directly on multi-GPU systems using the Message Passing Interface (MPI). The core innovation lies in performing network construction entirely within GPU memory ("onboard") without inter-process communication during the construction phase.

Core Algorithm

The method distinguishes between local connections (neurons within the same MPI process) and remote connections (neurons across different processes).

Independent Construction: Each MPI process independently builds its portion of the network. It creates local connectivity and prepares data structures for remote connections without communicating with other processes.
Proxy Representations: For remote connections, the method uses "image neurons" (proxies) in target processes. These are virtual representations of source neurons located in other MPI ranks.
Communication Maps: The algorithm instantiates contiguous communication maps in GPU memory to route spikes efficiently. These maps associate the index of a source neuron in a source rank with the index of its image neuron in a target rank.
Communication Schemes: The framework supports two MPI communication modes, selectable by the user based on network architecture:
- Point-to-Point: Uses direct communication between two processes. It is optimized for networks with uneven distributions of neurons or synapses (e.g., the Multi-Area Model). It utilizes specific mapping structures $(R_{\tau,\sigma}, L_{\tau,\sigma})$ and sequences $(T, P)$ to route spikes.
- Collective: Uses group-based communication (e.g., MPI_Allgather). This is advantageous for balanced networks with homogeneous communication payloads. It employs group-specific indexing arrays and host arrays to manage spike routing across multiple processes simultaneously.

GPU Memory Optimization

To balance GPU memory consumption and simulation speed, the authors implemented four GPU Memory Levels (GMLs):

Level 0: Remote connection maps and connection counts are stored in CPU memory.
Level 1: Similar to Level 0 but assumes all source neurons have images in target processes, avoiding checks for actual usage (faster construction, potentially higher memory waste).
Level 2: Maps and connection indices are stored in GPU memory; connection counts are computed on the fly. This is the default level.
Level 3: All data structures, including connection counts, are stored in GPU memory, minimizing CPU-GPU data transfer at the cost of higher GPU memory usage.

Models Evaluated

Multi-Area Model (MAM): A biologically detailed model of 32 vision-related areas of the macaque monkey cortex ( $4.13 \times 10^6$ neurons, $24.2 \times 10^9$ synapses). This model features complex, hierarchical connectivity and was simulated using point-to-point communication.
Scalable Balanced Network: A random network of excitatory and inhibitory neurons with fixed in-degree connectivity, designed to assess weak scaling performance. This model was simulated using collective communication on up to 1,024 GPUs.

Key Results

Network Construction Performance

The "onboard" GPU construction method demonstrated significant speedups compared to the previous "offboard" (CPU-based) approach:

MAM Simulation: Network construction time decreased from 686.0 s (offboard) to 55.5 s (onboard), a 12.4x speedup.
- Local connection creation saw a 20x speedup.
- Remote connection creation saw a 9x speedup.
- Neuron/device creation and simulation preparation saw speedups of 350x and 50x, respectively.
Scalable Balanced Network: The method successfully constructed networks up to 230.4 million neurons and 2.59 trillion synapses across 1,024 GPUs (256 nodes).

State Propagation and Scaling

MAM: The state propagation time (measured as Real-Time Factor) remained comparable between offboard and onboard versions (approx. 15–16), indicating that the construction optimization did not negatively impact simulation dynamics.
Balanced Network: The system demonstrated weak scaling up to 1,024 GPUs.
- Memory Efficiency: GPU Memory Level 0 allowed simulations to reach 4,096 nodes without exceeding the memory limits of NVIDIA A100 GPUs (64 GB). Higher memory levels (2 and 3) offered faster construction and simulation speeds but reached the memory limit at lower node counts (approx. 3,072 nodes for Level 3).
- Performance: Disabling spike recording in the balanced network reduced state propagation time by approximately 20%.

Validation

The new construction method was validated against the previous offboard version and the CPU-based NEST simulator. Despite changes in random number generation sequences due to the new algorithm, the statistical properties of the spiking activity (firing rates, coefficient of variation of inter-spike intervals, and pairwise Pearson correlations) were preserved, confirming the biological validity of the simulation.

Significance and Claims

The paper claims that this work provides the first GPU-based SNN simulation software capable of scaling to entire compute clusters (up to thousands of GPUs) while storing individual synapse information. The primary contributions are:

Scalable Construction: A novel algorithm that builds network connectivity directly in GPU memory, eliminating the CPU-GPU transfer bottleneck and avoiding MPI communication during the construction phase.
Flexibility: Support for both point-to-point and collective MPI communication, allowing adaptation to different network topologies (hierarchical vs. random/balanced).
Exascale Readiness: The authors extrapolate that their approach could simulate networks of $2 \times 10^{10}$ neurons and $10^{14}$ synapses on the upcoming JUPITER exascale supercomputer. This scale approaches the connectivity of the human cortex while maintaining individual synapse resolution.
Efficiency: By optimizing memory usage through the GML system, the method enables the simulation of larger networks on existing hardware (e.g., fitting the MAM on 8 GPUs instead of 32) and provides a pathway to utilize the full capacity of future exascale systems.

The authors conclude that this approach addresses the critical bottleneck of spike communication in distributed simulations and establishes NEST GPU as a reference platform for large-scale, biologically detailed neural simulations on modern HPC architectures.

Scalable Construction of Spiking Neural Networks using up to thousands of GPUs