Concurrent Deterministic Skiplist and Other Data Structures

Imagine you are the manager of a massive, super-fast library called the Many-Core Library. This library has thousands of tiny librarians (processors) working at the same time, but they are organized into different wings (NUMA nodes). If a librarian in the North Wing needs a book stored in the South Wing, it takes a long time to walk over there. The goal of this paper is to figure out how to organize the books so that every librarian can find what they need instantly, without tripping over each other or walking across the whole building.

The author, Aparna Sasidharan, tests three different ways to organize this library: Skiplists (a smart way to sort books), Queues (lines for waiting), and Hash Tables (a magic filing cabinet).

Here is a breakdown of her findings using simple analogies.

1. The Skiplist: The "Express Elevator" System

Most libraries use a standard list where you have to check every single book one by one to find the right one. That's slow. A Skiplist is like a building with express elevators.

How it works: Imagine you want to find a book on the 50th floor. Instead of walking up 50 flights of stairs, you take an elevator to the 40th floor, then a smaller elevator to the 48th, then a short walk to the 50th.
The Problem: The author looked at "Random" Skiplists (where the elevator stops are decided by rolling a dice) and "Deterministic" Skiplists (where the stops are planned perfectly).
The Twist: She built a Deterministic Skiplist (a perfectly planned elevator system) that works even when thousands of librarians are trying to use it at once.
The Result: While her perfectly planned system was great, she found that the "Random" version (the dice-rolling one) was actually faster when the library got too crowded. Why? Because the random system required less "re-arranging" of the shelves when new books arrived. It was more flexible.

2. The Queue: The "Recycling Bin" Strategy

Queues are like lines of people waiting for a ticket. In computer land, these lines are often made of paper slips (memory blocks).

The Problem: If every time someone leaves the line, you throw their slip in the trash and print a brand new one for the next person, you waste a lot of time and paper. Also, if everyone runs to the central supply closet to get new paper, the closet gets jammed.
The Solution: The author created a Lock-Free Queue with a "Recycling Bin."
- Instead of throwing slips away, she puts them in a bin.
- When a new person joins the line, she grabs a used slip from the bin, erases it, and reuses it.
- She also grouped the slips into "blocks" (like a stack of 8,000 slips) so the librarians don't have to walk to the supply closet as often.
The Result: This method was much faster and didn't clog up the library's memory, especially when many librarians were working at once.

3. The Hash Table: The "Magic Filing Cabinet"

A Hash Table is like a filing cabinet where you don't look for a name alphabetically; you use a magic formula to know exactly which drawer to open.

The Problem: When the cabinet gets full, you have to move everything to a bigger cabinet (resizing). This is like moving an entire library to a new building while people are still trying to find books. It causes chaos, and the "magic formula" often points to drawers that are far apart in the building, causing librarians to run back and forth (cache misses).
The Solution: She tested two types of cabinets:
1. The Flat Cabinet: One giant drawer system.
2. The Two-Level Cabinet: A main cabinet with small sub-cabinets inside.
The Result: The Two-Level Cabinet won. It was like having a main directory that told you exactly which small sub-cabinet to go to. This kept the librarians in one small area of the building, reducing the time they spent running across the library. It beat the standard "Intel" library system in many tests.

4. The Secret Sauce: Managing the "Wings" (NUMA)

The biggest lesson from the paper isn't just about the data structures, but where they live.

The Analogy: Imagine the library has 8 wings. If a librarian in Wing 1 has to constantly run to Wing 8 to grab a book, the whole system slows down.
The Strategy: The author's system tries to keep the books and the librarians in the same wing.
- She used "Huge Pages" (giant book covers) so librarians don't have to flip through tiny pages.
- She used "Recycling" so librarians don't have to walk to the supply closet.
- She split the work so that librarians in Wing 1 mostly talk to books in Wing 1.

The Big Takeaway

In a world with super-fast, multi-core computers, the old ways of organizing data often fail because they cause too much "traffic" and "walking distance" between different parts of the machine.

The paper shows that:

Recycling memory (reusing old blocks) is better than constantly buying new ones.
Grouping data (using two-level tables or blocks) keeps librarians in their local neighborhood, saving time.
Sometimes, a randomized system (like the random skiplist) is actually more efficient than a perfectly planned one because it requires less maintenance when things get busy.

By using these strategies, the author made data structures that can handle massive amounts of work without the computer getting tired or confused.

Here is a detailed technical summary of the paper "Concurrent Deterministic Skiplist and Other Data Structures" by Aparna Sasidharan.

1. Problem Statement

The paper addresses the scalability challenges of data-intensive applications running on modern many-core NUMA (Non-Uniform Memory Access) architectures (specifically AMD Milan nodes). While computational science applications often scale well due to regular memory access patterns, data-intensive workloads (involving point location and range searches) suffer from poor scalability due to:

Irregular Memory Access: Leading to high cache misses and page faults.
Remote NUMA Access: Accessing memory across different NUMA nodes incurs significant latency.
Lock Contention: Traditional concurrent data structures often rely on coarse-grained locking or randomization that introduces unpredictability and contention.
Memory Management Overhead: Frequent dynamic memory allocation (malloc/free) and lack of memory recycling degrade performance in multi-threaded environments.

The author aims to design, analyze, and evaluate scalable concurrent data structures (Skiplists, Queues, and Hash Tables) that minimize these overheads and optimize for NUMA locality.

2. Methodology

The research involves the design and implementation of three fundamental concurrent data structures, optimized for the AMD Milan many-core architecture.

A. Concurrent Deterministic Skiplist (1-2-3-4 Trees)

Design: Unlike standard randomized skiplists (which rely on probabilistic node heights), the author implements a deterministic 1-2-3-4 tree (a variant of (a,b) trees).
- Structure: It maintains a hierarchy where non-terminal nodes have between 2 and 5 children. Leaf nodes store the actual data in a linked list.
- Concurrency: Uses lock-free traversal for Find operations. Insert and Delete operations use fine-grained locking (locking nodes in an "L-shape" or "LL-shape" pattern) to maintain the 1-2-3-4 balance invariant.
- Re-balancing: Performed proactively during top-down traversals to avoid the need for a second bottom-up pass.
- Memory: Uses wide unsigned integers (128-bit) to store keys and pointers atomically, eliminating the need for separate pointer updates.

B. Unbounded Lock-Free Queue

Design: A custom implementation based on the LCRQ (Linearizable Concurrent Queue) algorithm but optimized for memory management.
Structure: Uses arrays of blocks (circular buffers) rather than linked lists to improve spatial locality.
Memory Management: Implements a memory pool with pre-allocated blocks. Blocks are recycled when empty, reducing the frequency of system calls to malloc/free.
Mechanism: Uses fetch-and-add for head/tail pointers and "Full/Empty" (FE) bits to signal data validity, ensuring lock-free progress.

C. Concurrent Hash Tables

The paper evaluates three implementations to find the most scalable solution for many-core nodes:

Fixed-Size with Binary Trees: Fixed slots with binary trees for collision resolution.
Two-Level Hash Tables: A hierarchical structure where collisions trigger a second-level hash table (also using binary trees).
Split-Order Hash Tables: A dynamic implementation where slots are separate from nodes. Nodes are stored in linked lists sorted by reverse keys.
- Optimization: The author proposes a Two-Level Hierarchical Split-Order table to improve spatial locality and reduce remote memory access.

NUMA Strategy: Keys are partitioned across NUMA nodes based on the Most Significant Bits (MSB) of the hash key. Lock-free queues distribute keys to threads pinned to specific NUMA nodes, ensuring data locality.

D. Memory Management Strategy

A custom concurrent memory manager is employed across all structures:

Recycling: Deleted nodes are enqueued to a lock-free queue and reused, preventing fragmentation and reducing malloc calls.
NUMA Awareness: Separate memory managers are instantiated per NUMA node to minimize cross-node memory traffic.
Huge Pages: Utilization of huge pages to reduce TLB misses.

3. Key Contributions

First Concurrent Deterministic Skiplist: The paper presents the first concurrent implementation of a deterministic 1-2-3-4 skiplist. It guarantees $O(\log n)$ complexity for all operations without the variance inherent in randomized skiplists.
NUMA-Aware Hierarchical Design: The introduction of hierarchical data structures (two-level hash tables, per-NUMA skiplists) that explicitly map data to local memory domains, significantly reducing remote memory access latency.
Integrated Memory Management: A unified strategy for memory recycling and block allocation that reduces page faults and cache misses, specifically tailored for high-concurrency environments.
Comprehensive Benchmarking: Extensive performance evaluation comparing custom implementations against Intel's Thread Building Blocks (TBB) and Boost libraries on a real-world supercomputer (NCSA Delta).

4. Results

Experiments were conducted on the NCSA Delta supercomputer using 10M, 100M, and 1B operations with varying thread counts (up to 128 threads).

Skiplists:
- The Deterministic Skiplist showed strong scaling but suffered from lock contention at very high thread counts (64+ threads) compared to a baseline with Read/Write locks.
- Randomized Skiplists (lock-free) outperformed the deterministic version in high-concurrency scenarios because they avoided the overhead of complex re-balancing operations.
- Conclusion: While deterministic offers predictable complexity, randomized skiplists are currently superior for massive parallelism due to lower contention.
Queues:
- The custom Lock-Free Queue (with block recycling) demonstrated strong scaling.
- It outperformed the Boost implementation significantly.
- It showed comparable or slightly better performance than TBB's LCRQ, with the trade-off being the overhead of block recycling logic (which improved cache behavior).
Hash Tables:
- Two-Level Hierarchical Split-Order Tables achieved the best performance among the custom implementations, outperforming fixed-size tables and binary-tree-based tables.
- The hierarchical approach drastically reduced cache misses and page faults by keeping related data in local NUMA nodes.
- Comparison with TBB: The custom two-level split-order implementation was comparable to TBB's concurrent hash table in scalability. At low thread counts, TBB was faster due to pre-allocation of large memory segments, but the custom implementation scaled better as thread counts increased.

5. Significance

Predictability vs. Performance: The work highlights the trade-off between deterministic guarantees (1-2-3-4 trees) and raw performance (randomized skiplists) in concurrent environments.
NUMA Optimization: The paper provides a concrete blueprint for optimizing data structures for NUMA architectures, proving that data locality (via hierarchical partitioning) is often more critical than algorithmic complexity reduction in many-core systems.
Memory Efficiency: The proposed memory management strategies (recycling, block allocation) offer a practical solution to the "malloc bottleneck" in high-performance concurrent applications.
Future Directions: The authors suggest porting these designs to GPUs and extending them to distributed systems using MPI/RPC, leveraging the linearizability of their designs to ensure correctness in distributed settings.

In summary, the paper demonstrates that while deterministic data structures offer theoretical guarantees, randomized structures combined with aggressive NUMA-aware memory management and hierarchical decomposition currently provide the best performance for data-intensive workloads on modern many-core processors.