WF-Bench: A Benchmark for Neural Network WaveFunction… — Plain-Language Explanation

Imagine you are trying to teach a robot to paint a perfect picture of a complex quantum world. In the world of physics, these "pictures" are called wavefunctions. They describe how tiny particles like electrons dance, interact, and arrange themselves. For a long time, scientists have used Neural Networks (a type of AI) to try and guess what these pictures look like.

However, there was a problem: everyone was using different test pictures, different painting styles, and different ways to grade the work. It was impossible to tell if one AI was truly better than another, or if it just happened to be good at a specific type of picture.

This paper introduces WF-Bench, a solution to that problem. Think of WF-Bench as a universal "driving test" for these AI painters.

The "Driving Test" (The Dataset)

Just as a driving test checks if you can handle a rainy highway, a snowy mountain, and a busy city, WF-Bench tests AI wavefunctions on three very different types of "quantum terrain":

Topological States (The Twisted Knots): Imagine a piece of string tied in incredibly complex, knotted patterns that can't be untangled without cutting. These represent exotic states of matter where particles have a "twisted" relationship.
Superconductors (The Perfect Dance): Imagine a ballroom where every dancer moves in perfect, synchronized pairs. These are materials where electricity flows with zero resistance.
Wigner Crystals (The Frozen Grid): Imagine a crowd of people who, because they are so annoyed by each other, stand perfectly still in a rigid grid pattern. This happens when electrons repel each other so strongly they freeze in place.

The dataset contains 31 different "target pictures" from these three categories. Some are simple, while others are incredibly complex with strange phases and patterns.

The "Grading System" (The Protocol)

To see how well an AI paints, the researchers use a metric called Fidelity.

The Analogy: Imagine the AI is a student taking a test. The "Target Wavefunction" is the answer key. Fidelity is the percentage of the answer key the student gets right.
The Challenge: As the number of electrons (the "students" in the room) increases, the test gets exponentially harder. The paper found that for all these AI models, the "score" (fidelity) drops as the system gets bigger, following a predictable mathematical pattern (a power law).

The "Paintbrushes" (The Architectures)

The researchers tested two popular AI "paintbrushes" (architectures) on this test:

Ferminet: A model that looks at both individual electrons and how pairs of electrons interact.
Psiformer: A model that uses a "self-attention" mechanism (similar to how modern AI like ChatGPT works) to look at the whole group of electrons at once.

The Result: When given the same amount of "brainpower" (number of parameters), Psiformer consistently painted a better picture than Ferminet. It got higher scores across almost every test, especially on the most complex, twisted "Topological" knots.

The "Diminishing Returns" (Scaling Laws)

The paper also looked at how adding more "tools" to the AI affects its performance:

More Determinants (More Brushes): Adding more "determinants" (mathematical building blocks) helps the AI improve quickly at first. But after a certain point (around 32), adding more brushes doesn't make the picture much better. It's like having 100 paintbrushes when you only need 4; the extra ones just add weight without adding color.
More Layers (Deeper Thinking): Making the AI "deeper" (adding more layers of processing) helps a lot when going from 1 layer to 2. But going from 2 layers to 10 doesn't help much. The AI hits a "ceiling" where it can't learn much more from just being deeper.

The Bottom Line

This paper didn't just build a dataset; it built a standardized ruler.

It proved that Psiformer is currently a stronger "painter" than Ferminet for these tasks.
It showed that bigger isn't always better: Adding too many tools or making the AI too deep doesn't guarantee a better picture.
It established that complexity grows fast: As the number of particles increases, it becomes mathematically harder for any AI to capture the perfect picture, but WF-Bench now gives scientists a way to measure exactly how hard it is for different models.

In short, WF-Bench is the tool that allows scientists to stop guessing which AI is best and start measuring it fairly, ensuring that future quantum simulations are built on solid, comparable ground.

Technical Summary: WF-Bench

Problem Statement
Neural network (NN) wavefunctions have emerged as powerful variational ansätze for solving quantum many-body problems, demonstrating scalability across tasks ranging from ground-state optimization to real-time dynamics. However, despite rapid architectural advancements (e.g., Ferminet, Psiformer, graph neural networks), the field lacks a systematic understanding of how representational power varies across different physical systems and model architectures. Specifically, there is no unified framework to evaluate NN wavefunction expressivity or to characterize empirical scaling laws regarding system size and model capacity. Existing studies often focus on specific regimes or models, leaving a gap in comprehensive, reproducible benchmarking.

Methodology
To address this, the authors introduce WF-Bench, a comprehensive benchmarking dataset and protocol designed to evaluate NN wavefunction expressivity.

Dataset Composition: WF-Bench comprises over 30 target wavefunctions spanning three distinct classes of strongly correlated quantum matter:
1. Topological States: Includes Laughlin and Moore-Read states (fractional quantum Hall systems) with varying filling factors and quasihole excitations. These feature nontrivial topological order and complex phase structures.
2. Superconducting States: A family of Bardeen-Cooper-Schrieffer (BCS) wavefunctions with diverse pairing symmetries (s-, p-, d-, f-wave) and spin configurations (singlet/triplet), realized via antisymmetrized geminal power (AGP).
3. Wigner Crystals: States exhibiting spontaneous translational symmetry breaking driven by strong Coulomb interactions, constructed using localized orbitals (Gaussian, squeezed Gaussian, and moiré potentials).
Benchmarking Protocol: The authors propose a uniform training and evaluation framework based on fidelity optimization.
- Loss Function: The primary metric is wavefunction fidelity ( $F$ ), optimized via the loss $L_F = -\log |\langle \Psi_\theta | \Phi \rangle|^2 / (\langle \Psi_\theta | \Psi_\theta \rangle \langle \Phi | \Phi \rangle)$ .
- Optimization Challenges: Direct fidelity optimization suffers from vanishing signals and high variance in large systems due to interference. For topological states with complex phases, the authors employ a pretraining strategy using a hybrid loss ( $L_{pre}$ ) that combines probability matching ( $L_1$ ) and current matching ( $L_2$ ). This mitigates "self-trapping" issues where networks match amplitudes on small configuration sets without global probability mass movement.
- Evaluation: The protocol systematically varies three key parameters: electron number ( $N_e$ ), number of determinants ( $N_{det}$ ), and network depth ( $N_{layer}$ ).
Architectures Tested: The protocol is applied to two widely used architectures: Ferminet (utilizing streaming permutation-equivariant one- and two-body features) and Psiformer (leveraging self-attention mechanisms).

Key Results
By applying WF-Bench to Ferminet and Psiformer, the authors derive empirical scaling laws for the maximum achievable fidelity ( $F$ ):

System Size Scaling ( $N_e$ ):
- Fidelity decay follows a power law: $F \approx 1 - \alpha(N_e - 2)^\beta$ .
- The exponent $\beta$ reflects the correlation strength and phase complexity. Topological states exhibit the fastest decay (high $\beta$ ), followed by superconductors, while Wigner crystals show the slowest decay due to strong electron localization suppressing complex phase winding.
- Architectural Comparison: At comparable parameter counts, Psiformer consistently achieves higher fidelity than Ferminet across all target wavefunctions. For example, at $N_e=10$ for topological states, Psiformer ( $8.3 \times 10^5$ params) outperforms Ferminet ( $7.3 \times 10^5$ params).
Model Capacity Scaling ( $N_{det}$ and $N_{layer}$ ):
- Determinants ( $N_{det}$ ): Fidelity shows clear diminishing returns. Rapid improvements are observed for small $N_{det}$ , but performance saturates beyond $N_{det} \approx 32$ .
- Depth ( $N_{layer}$ ): Increasing depth from 1 to 2 layers yields marked fidelity improvements, particularly for complex states like Moore-Read. However, further increases beyond $N_{layer}=2$ provide only modest gains, suggesting that deeper architectures do not substantially enhance representation power for these tasks.
Representational Difficulty: The difficulty of representing a state is jointly determined by the prefactor $\alpha$ (baseline error) and the exponent $\beta$ . For instance, chiral triplet superconductors and Moore-Read states present significant challenges due to complex amplitudes and phase structures.

Significance and Claims
The paper claims that WF-Bench establishes a unified, dataset-driven framework for evaluating and comparing neural network wavefunctions. Its primary contributions are:

Standardization: It provides a reproducible protocol for fair comparison across different architectures and physical regimes, moving beyond ad-hoc evaluations.
Empirical Laws: It identifies specific scaling laws governing NN wavefunction representability, linking scaling exponents to physical properties like correlation strength and phase complexity.
Guidance for Design: The findings on diminishing returns for $N_{det}$ and $N_{layer}$ offer practical guidance for designing future architectures, suggesting that increasing model width or depth beyond certain thresholds may be computationally inefficient compared to other architectural innovations.

The authors position WF-Bench as a community resource intended to guide the design of future architectures and facilitate theoretical analysis of expressivity scaling. They note that while the current optimization protocols are effective, they remain open to further improvement, which could refine the observed scaling behaviors.

WF-Bench: A Benchmark for Neural Network WaveFunction Expressivity and Scaling Laws