Original authors: Shuvro Chowdhury, Jasper Pieterse, Navid Anjum Aadit, Shaila Niazi, Johan H. Mentink, Kerem Y. Camsari

Published 2026-05-13

📖 4 min read🧠 Deep dive

CC BY 4.0

Original authors: Shuvro Chowdhury, Jasper Pieterse, Navid Anjum Aadit, Shaila Niazi, Johan H. Mentink, Kerem Y. Camsari

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine trying to predict the behavior of a massive crowd of people, where every single person is constantly reacting to their neighbors in complex, invisible ways. In the world of physics, this is what scientists call a "quantum many-body system." Trying to simulate this on a regular computer is like trying to count every grain of sand on a beach while the wind is blowing them around; it's incredibly slow and often impossible for large crowds.

This paper introduces a new way to solve this problem by combining smart software with specialized hardware. Here is the breakdown of their approach using simple analogies:

1. The Problem: The "Traffic Jam" of Simulation

Scientists use a method called "Neural Quantum States" (NQS) to model these quantum crowds. Think of a neural network as a very smart map that predicts how the crowd will behave. However, to update this map, the computer has to run millions of random simulations (like asking the crowd, "What if everyone moved one step left?") to see what happens.

On standard computers (CPUs), this sampling process is a massive traffic jam. The computer spends so much time generating these random scenarios that it can't actually learn the answer. This is the "bottleneck" the authors wanted to fix.

2. The Solution: A Specialized "Probabilistic" Engine

Instead of asking a general-purpose computer to simulate randomness, the authors built a custom machine using FPGAs (chips that can be reprogrammed to act like specialized hardware).

The Analogy: Imagine a standard computer is a single, very smart librarian trying to organize a library by hand. It's accurate but slow. The authors' Probabilistic Computer is like hiring 2,200 tiny, fast workers (called p-bits) who can all shuffle books simultaneously.
How it works: These p-bits are simple units that flip between two states (like a coin landing on heads or tails) based on their neighbors. Because they are built directly into the hardware, they don't need to "think" about being random; they are random by nature. This allows them to generate the millions of scenarios needed for the simulation almost instantly.

3. The First Breakthrough: Simulating a Giant Crowd

The team used this new hardware to simulate a 2D grid of quantum spins (like a grid of tiny magnets).

The Result: They successfully simulated a grid of 80 by 80 (6,400 spins).
Why it matters: Previous methods struggled to get this high without crashing or taking forever. Their custom hardware allowed them to reach this size with high accuracy, proving that specialized "probabilistic" chips can handle quantum simulations that are too big for standard computers.

4. The Second Breakthrough: The "Deep" Learning Trick

The authors also wanted to use "Deeper" neural networks (stacking more layers of logic) because they are better at understanding complex patterns. However, deep networks usually require a mathematical step called "marginalization," which is like trying to calculate the average height of a crowd by measuring every single person individually—it's computationally impossible for deep networks.

The Innovation: They invented a "Dual-Sampling Algorithm."
The Analogy: Instead of trying to measure the whole crowd at once, they fix the people on the outside (the visible layer) and only ask the people in the middle (the hidden layers) to shuffle around. By doing this "conditional sampling," they can figure out the answer without doing the impossible math.
The Result: They successfully trained these deep networks on a single FPGA chip for a system of 30 by 30 (900 spins). They found that these deep networks were actually more efficient, needing fewer "settings" (parameters) to get the same accurate result as simpler, shallower networks.

Summary

In short, the paper claims two main things:

Hardware Speed: By building a custom chip (FPGA) that acts like a massive army of random coin-flippers, they removed the speed limit that was stopping quantum simulations from growing larger. They simulated a system of 6,400 particles, a size previously out of reach for this type of method.
Smarter Algorithms: They created a new way to train "deep" neural networks for quantum physics that avoids impossible math calculations. This allows for more powerful models that are also more efficient.

The authors conclude that by combining this specialized hardware with their new algorithms, we can now simulate quantum systems that are much larger and more complex than ever before, opening the door to understanding materials and physics that were previously too difficult to study.

Technical Summary: Probabilistic Computers for Neural Quantum States

1. Problem Statement

Accurate classical simulation of quantum many-body systems is a fundamental challenge in condensed matter physics and quantum chemistry. While established methods like Quantum Monte Carlo (QMC) and tensor networks have achieved high precision, they face intrinsic limitations: QMC suffers from sign problems in generic systems, and tensor networks struggle with unfavorable entanglement scaling in two dimensions and near criticality.

Neural Quantum States (NQS), which parameterize many-body wavefunctions using neural networks, offer a scalable alternative. However, the variational Monte Carlo (VMC) training of NQS is bottlenecked by the computational cost of Markov Chain Monte Carlo (MCMC) sampling. As system sizes increase, the time required to estimate observables and stochastic parameter gradients via sampling becomes prohibitive, even for relatively simple architectures like Restricted Boltzmann Machines (RBMs). This bottleneck prevents scaling to the large system sizes (e.g., $>10^3$ spins) necessary for exploring complex quantum phases.

2. Methodology

The authors propose a hardware-software co-design approach to overcome the sampling bottleneck by mapping sparse Boltzmann machine architectures directly onto probabilistic computing hardware.

A. Probabilistic Hardware Architecture

The core of the methodology is the implementation of a probabilistic computer (p-computer) using Field-Programmable Gate Arrays (FPGAs).

P-bits: The hardware utilizes probabilistic bits (p-bits), classical stochastic units that fluctuate between logic states $\{-1, +1\}$ . These units naturally implement the Boltzmann distribution required for sampling.
Sparse Connectivity (FRBM): To avoid the routing congestion and $O(N^2)$ wiring complexity of dense networks, the authors employ a Further Restricted Boltzmann Machine (FRBM). This architecture enforces strictly local connectivity (Euclidean distance $k=2$ , corresponding to 13 neighbors per spin), reducing wiring complexity to $O(N)$ .
Hybrid Execution: A host CPU handles parameter optimization (using Stochastic Reconfiguration), while the FPGA acts as a high-throughput sampler. The FPGA generates spin configurations via parallel p-bit updates, which are transferred to the CPU for gradient accumulation and parameter updates.
Precision: The FPGA implementation uses 10-bit fixed-point arithmetic to maximize p-bit density and parallelism, while the host CPU uses single-precision floating-point (FP32) for numerical stability in optimization.

B. Dual-Sampling Algorithm for Deep Models

To enable the training of Deep Boltzmann Machines (DBMs)—which are more expressive than shallow RBMs but suffer from intractable marginalization over hidden units—the authors introduce a dual-sampling algorithm.

Concept: Instead of marginalizing over auxiliary variables (which is computationally expensive), the algorithm replaces this step with conditional sampling.
Process:
1. Outer Loop: Sample visible configurations ( $v$ ) from the physical layer.
2. Inner Loop: For each fixed visible configuration, clamp the visible units and perform Gibbs sampling over the auxiliary (hidden and deep) layers.
3. Estimation: Wavefunction ratios required for local energy calculations are estimated as conditional expectations over the auxiliary variables given the fixed visible state.
Efficiency: This approach decouples physical spin sampling from auxiliary layer sampling, reducing variance and avoiding the need to resample for every single-spin flip. It allows for the training of sparse deep architectures under strict locality constraints.

C. Scalability Strategy

Multi-FPGA Clustering: For large systems (e.g., $80 \times 80$ lattices), the FRBM graph is partitioned across multiple FPGAs using the METIS graph partitioning tool. Boundary p-bits are exchanged asynchronously over high-speed FMC links, while local p-bits update synchronously. This allows the system to scale beyond the resources of a single chip.

3. Key Contributions

Hardware-Accelerated Sampling: The authors demonstrate the mapping of sparse Boltzmann machines onto a multi-FPGA cluster, achieving massive sampling speedups compared to CPU and GPU baselines.
Dual-Sampling Algorithm: They introduce a novel algorithm that makes the training of sparse Deep Boltzmann Machines feasible for variational Monte Carlo by replacing intractable marginalization with conditional sampling.
Parameter Efficiency: They demonstrate that sparse deep architectures (DBMs) achieve lower variational energies with significantly fewer parameters compared to shallow networks (RBMs), improving parameter efficiency.

4. Results

The methodology was validated on the two-dimensional transverse-field Ising model (TFIM) at criticality.

Single-FPGA Performance:
- For a $35 \times 35$ lattice (1,225 spins), the system reached chemical accuracy (relative error $|\Delta E/E_{ref}| \le 1.6 \times 10^{-3}$ ) within $\approx 100$ optimization iterations.
- Sampling consumed less than 5% of the total wall-clock time on the FPGA, whereas a CPU baseline spent 20–30% of its time on sampling even with significantly fewer samples.
- Ground-state energies interpolated smoothly between ferromagnetic and field-polarized limits, matching Continuous-Time Path Integral Monte Carlo benchmarks.
Multi-FPGA Scaling:
- Using a cluster of six interconnected FPGAs, the authors simulated lattices up to $80 \times 80$ (6,400 spins).
- The system maintained convergence within chemical accuracy as system size increased, with boundary communication overhead minimized (cut fractions of 5.6% for $L=80$ ).
- Asynchronous communication allowed local p-bits to be overclocked to 15 MHz, significantly outperforming the clock speeds required for strict global synchronization.
Deep Model Training:
- On a $10 \times 10$ lattice, the dual-sampling algorithm successfully trained a sparse DBM, achieving chemical accuracy.
- Parameter Efficiency: The sparse DBM achieved lower variational energies with approximately half the number of parameters ( $N_p \approx 1300$ ) compared to a sparse RBM ( $N_p \approx 3100$ ) required to reach similar accuracy.
- Scalability: The algorithm was successfully applied to a $30 \times 30$ lattice (900 spins) on a single FPGA, demonstrating the feasibility of training deep models for systems previously difficult to handle with deep NQS.
- Algorithmic scaling analysis on a GPU showed that iteration time scales quadratically with linear dimension ( $t_{iter} \propto L^2$ ) under fixed sparsity, consistent with the total number of spins $N=L^2$ .

5. Significance and Claims

The paper claims that probabilistic hardware effectively alleviates the sampling bottleneck in the variational simulation of quantum many-body systems. By combining sparse Boltzmann machine architectures with p-bit hardware, the authors demonstrate:

Scalability: The ability to simulate quantum systems with up to 6,400 spins, surpassing the limits of current CPU- and GPU-based NQS implementations.
Architectural Depth: The introduction of dual sampling enables the training of deep, sparse models, which offer better parameter efficiency and the capacity to represent complex correlations (such as volume-law entanglement) that shallow networks cannot.
Future Path: The work positions probabilistic computing as a scalable route for classically simulating quantum matter. The authors suggest that as p-bit architectures mature from FPGA prototypes to dedicated CMOS circuits, further integration of sampling, local energy evaluation, and gradient accumulation on a single die could reduce latency and energy consumption by orders of magnitude, making VMC practical for quantum systems far larger than those accessible today.

The authors remain modest regarding non-stoquastic systems, noting that extending the approach to systems with non-trivial sign structures would require complex parameters or phase networks, which is beyond the current scope. Similarly, while the sampling bottleneck is addressed, the overall training cost remains linear in system size due to host-based stochastic reconfiguration updates, which they identify as a target for future hardware acceleration.

Probabilistic Computers for Neural Quantum States