Faster Positional-Population Counts for AVX2, AVX-512, and ASIMD

Imagine you are running a massive, high-speed sorting facility for a library. But instead of sorting books by title or author, you are sorting bits (the tiny 0s and 1s that make up all digital data).

In this library, every "book" is a long strip of paper with 32, 64, or even 512 tiny switches on it. Some switches are flipped ON (1), and some are OFF (0).

The Problem: The "Positional" Count

Usually, if you want to know how many switches are ON in a whole room, you just count them all up. That's easy.

But in this paper, the researchers are solving a much trickier puzzle called Positional Population Count.

Imagine you have 1,000 strips of paper. On every strip, there are 64 switches.

You don't just want the total number of ON switches.
You want to know: How many times was the first switch ON across all strips? How many times was the second switch ON? The third? All the way to the 64th?

It's like asking: "Out of all the people in this stadium, how many are wearing a red hat on their left hand, how many on their right hand, how many on their head, etc.?" You need a separate counter for every single position.

The Old Way: The Slow Tally Clerk

Previously, computers did this like a very slow, meticulous clerk. They would pick up one strip, look at the first switch, update a counter. Then look at the second switch, update that counter. Then move to the next strip.

This was slow because the clerk had to look at every single switch one by one. If you had a huge pile of data, the computer would spend all its time just looking at the switches, not actually processing them fast.

The New Way: The Super-Team (SIMD)

The authors of this paper (Robert, Daniel, and Florian) built a Super-Team to do the job. They use a technology called SIMD (Single Instruction, Multiple Data).

Think of SIMD as a giant robotic arm that can grab a whole stack of 64 strips of paper at once. Instead of looking at one switch at a time, the robot looks at the first switch of all 64 strips simultaneously, then the second switch of all 64 strips simultaneously, and so on.

However, just grabbing the paper isn't enough. You still have to count the switches. If you try to count them one by one, the robot gets stuck.

The Secret Sauce: The "Carry-Save" Magic Trick

The paper's main innovation is a clever math trick called a Carry-Save Adder (CSA).

Imagine you have a pile of coins.

Normal Addition: You count them one by one. "One, two, three..." This takes a long time.
The Magic Trick: Instead of counting, you group the coins into piles of 3. You say, "Okay, I have 3 coins here, that's worth 1 coin in the next pile up, and I have 0 left over." You do this for the whole room instantly.

The researchers use this trick to compress the data. They take 15 or 16 giant stacks of paper and, in a single instant, compress them down into just a few "summary" stacks. They do this so fast that the computer spends almost no time thinking; it just moves data around.

Why This Paper is Special

The researchers didn't just make the robot faster; they fixed three big problems that previous robots had:

The "Short Stack" Problem:
- Old Robot: If you only had 5 strips of paper, the robot was too big and clumsy to work. It had to do a lot of setup work just to count 5 items, making it slower than the slow clerk.
- New Robot: They built a special "mini-mode." Even if you only have 2 bytes of data (a tiny amount), the robot is fast enough to beat the slow clerk. It works efficiently from the very first byte.
The "Misaligned" Problem:
- Old Robot: If the stack of paper didn't start exactly on a perfect line (like starting at the 3rd inch instead of the 0-inch mark), the robot would crash or have to stop and realign everything manually.
- New Robot: It has "smart gloves." It can grab a stack even if it's slightly crooked, ignore the empty space, and get to work immediately without stopping.
The "Tall Stack" Problem:
- Old Robot: When the robot finished counting, it had to write the results down one by one, which was slow.
- New Robot: They invented a way to "transpose" the data. Imagine taking a deck of cards, shuffling them so that all the "Aces" are in one pile, all the "Kings" in another, and then counting those piles all at once. This allows the computer to dump the final results into memory incredibly fast.

The Results: Speeding Up the World

The researchers tested their new algorithm on modern computer chips (Intel and ARM).

The Result: Their method is so fast that it hits the "speed limit" of the computer's memory. This means the computer is working as fast as it can possibly read the data. It can't get any faster unless the wires in the computer are upgraded.
The Impact: This is huge for things like DNA sequencing (finding patterns in your genes), databases (finding how many people live in a specific city), and cryptography.

In a Nutshell

The paper presents a new, super-efficient way to count bits in a specific order.

Old way: Counting switches one by one.
New way: Using a giant robotic arm (SIMD) combined with a magic math trick (Carry-Save Adders) to count thousands of switches simultaneously.
The Bonus: It works just as well on a tiny pile of data as it does on a mountain of data, making it useful for almost any computer task.

It's like upgrading from a person counting grains of sand with a spoon to a vacuum cleaner that sucks up the whole beach in one second.

1. Problem Statement

Positional Population Count (pospopcnt) is an operation that, given an array of $w$ -bit words, counts how often each bit position (from 0 to $w-1$ ) is set across the entire array.

Context: Unlike the standard population count (which sums all set bits in a word), pospopcnt treats the input as $w$ interleaved bitstreams. It is crucial for applications involving one-hot encoded categorical variables (common in bioinformatics, database group-by queries, and wavelet trees).
Challenge: Previous state-of-the-art algorithms (specifically by Klarqvist et al.) relied on SIMD (Single Instruction, Multiple Data) techniques but suffered from high startup costs. They required large input sizes (several kilobytes) to outperform scalar code because they struggled with unaligned data, very short arrays, and the overhead of accumulating intermediate results into final counters.

2. Methodology

The authors propose a refined algorithm based on the Harley-Seal scheme, optimized specifically for modern SIMD architectures (AVX2, AVX-512, and ARM ASIMD). The core methodology involves three main pillars:

A. Optimized Carry-Save Adder (CSA) Networks

The algorithm uses CSA networks to compress bit populations without immediate carry propagation, allowing for high instruction-level parallelism (ILP).

Initial Iteration (CSA15): Instead of initializing accumulators to zero and running the main loop immediately, the authors process the first 15 vectors of input using a dedicated CSA15 network. This reduces the initial 15 vectors into 4 accumulator vectors ( $a_8, a_4, a_2, a_1$ ) using only 11 full-adder steps, saving computational overhead compared to the standard 15-step approach.
Main Loop (CSA16+4): The main loop processes 16 vectors of input at a time, combining them with the existing 4 accumulators to produce a 5-bit accumulator set ( $a_{16}, a_8, a_4, a_2, a_1$ ). The top vector ( $a_{16}$ ) is extracted and processed, while the remaining four are carried over to the next iteration.

B. Bit-Parallel Accumulation and Transposition

A major bottleneck in previous methods was the scalar-like accumulation of the CSA results into the final counters. The authors introduce a fully vectorized approach:

Transposition: The bits in the accumulator vectors (e.g., $a_{16}$ ) are transposed so that bits of the same significance are grouped together.
Folding/Reduction: Using a recursive matrix transposition strategy, the algorithm folds the data (splitting even/odd bits, shifting, and adding) to reduce the bit-width from the vector size down to 16-bit or 32-bit counters.
Efficiency: This process is implemented using specific SIMD shuffle and logical instructions (e.g., vpternlogd on AVX-512, bsl on ASIMD) to perform the reduction in $O(\log w)$ steps rather than linear steps.

C. Handling Edge Cases

The algorithm includes specialized handling for non-ideal inputs:

Head Processing: To handle unaligned memory addresses without faults, the algorithm loads an initial vector from an aligned address, clears the "prefix" bytes (data before the actual start), and uses this as the first input to the CSA network.
Short Arrays & Tail Processing: For inputs smaller than the main loop block size (or the remaining "tail" of a large array), a specialized bit-parallel scalar tail algorithm is used. This avoids the overhead of the main loop setup and ensures performance remains high even for inputs as small as 2 bytes.

3. Key Contributions

Improved Algorithm Structure: A modified Harley-Seal scheme that separates the initial 15-vector reduction from the main loop, reducing startup latency.
Fully Vectorized Accumulation: Replacing scalar accumulation loops with bit-parallel transposition and reduction, significantly lowering the instruction count per byte.
Robust Edge Case Handling: Effective strategies for unaligned memory access and very short arrays, enabling "good performance from the first byte."
Cross-Architecture Implementation: The authors provide open-source, assembly-optimized implementations for:
- Intel AVX2 (256-bit vectors)
- Intel AVX-512 (512-bit vectors, utilizing F and BW instruction sets)
- ARM ASIMD (128-bit vectors for AArch64)
Variable Word Width Support: The algorithm is generic, allowing the user to select any word width $w$ (power of 2) at runtime by swapping the accumulation function.

4. Results

The authors benchmarked their implementation against the Klarqvist et al. algorithm and scalar baselines on Intel Xeon W-2133 (Skylake) and AWS Graviton 3 (Neoverse V1).

Throughput:
- AVX-512: Achieved a peak throughput of 91.0 GB/s on 512 KiB inputs. This is a 53% improvement over the best Klarqvist et al. kernel (59.4 GB/s).
- AVX2: Achieved 34.8 GB/s.
- ASIMD: Achieved 16 GB/s (approx. 83% of the theoretical roofline for that architecture).
Small Input Performance: Unlike previous methods, the new algorithm performs well on very small arrays (as small as 4 KiB, and even down to 2 bytes), approaching memory-bound speeds immediately without a "warm-up" period.
Instruction Efficiency: The AVX-512 implementation requires only 0.09 instructions per byte, compared to 0.13 for the previous best.
Memory Bound: The algorithm becomes memory-bound for arrays larger than a few kilobytes, meaning it saturates the available memory bandwidth.

5. Significance

Practical Impact: The ability to process short arrays efficiently makes this algorithm viable for real-world database queries and bioinformatics tasks where data chunks may be small or irregular.
Architectural Adaptability: By demonstrating high performance across Intel (AVX2/AVX-512) and ARM (ASIMD) architectures, the work provides a robust foundation for cross-platform high-performance computing libraries.
Algorithmic Advancement: The paper demonstrates that by optimizing the structure of the accumulation phase (transposition and reduction) rather than just the counting phase, significant gains can be made even on hardware with fixed vector widths.
Open Source: The code is made available, facilitating adoption in fields like metagenomic profiling (e.g., KMCP tool) and database engines.

In summary, this paper presents a highly optimized, portable, and low-latency solution for positional population counts, overcoming the limitations of prior SIMD approaches and setting a new benchmark for bit-parallel processing on modern CPUs.