ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs

Imagine you have a massive library of knowledge (a Large Language Model, or LLM) that you want to read and answer questions from. Usually, people use super-fast, expensive graphics cards (GPUs) to do this. But what if you only have a giant, powerful computer server with hundreds of regular processor cores (CPUs)?

This is where the paper ARCLIGHT comes in. It's a new way of running these AI models on those big CPU servers, and it solves a very specific, boring, but deadly problem: The "Cross-Node" Traffic Jam.

Here is the story of ARCLIGHT, explained simply.

1. The Problem: The "Office Building" Traffic Jam

Imagine a huge office building with 128 employees (the CPU cores) and 4 separate floors (the NUMA nodes). Each floor has its own supply closet (local memory).

The Old Way (llama.cpp): The manager tells all 128 employees to work together on one task. But here's the catch: The manager doesn't care which floor the supply closet is on. If an employee on Floor 1 needs a stapler from the Floor 4 closet, they have to take the elevator, walk across the building, and wait in line.
The Result: Even though you have 128 employees, they spend 75% of their time waiting for elevators and walking. They are "stalled" by the distance between the worker and the data. This is called the Cross-NUMA Memory Wall.

Existing software (like llama.cpp) was built for smaller, simpler computers. When you try to run it on these giant 128-core servers, it gets bogged down because it doesn't know how to keep workers on their own floors.

2. The Solution: ARCLIGHT (The Smart Manager)

The authors built ARCLIGHT from scratch. Think of it not as a heavy, bloated software suite, but as a lightweight, minimalist toolkit designed specifically for these giant office buildings.

It fixes the traffic jam with three main tricks:

A. The "Local Supply Closet" Strategy (Memory Management)

Instead of letting the operating system randomly place supplies, ARCLIGHT says: "If you are on Floor 1, your supplies stay on Floor 1."

It creates separate supply closets for each floor.
It ensures that when an employee needs data, they grab it from the closet right next to them. No elevators, no walking.

B. The "Split Team" Strategy (Tensor Parallelism)

In the old way, everyone tried to do the exact same math at the same time, which caused confusion about who needed what.
ARCLIGHT changes the game:

It splits the big math problem into four smaller pieces.
Floor 1 does the first piece using Floor 1's data.
Floor 2 does the second piece using Floor 2's data.
They work completely independently. They don't need to talk to each other until the very end.
Analogy: Instead of 128 people trying to build one giant Lego castle together (bumping into each other), you give 32 people on Floor 1 a small section of the castle to build, and 32 people on Floor 2 build their own section. They only meet up at the end to snap the pieces together.

C. The "Flexible Assembly Line" (Thread Scheduling)

Sometimes, the workers on different floors finish their small tasks at different speeds.

Old Way: Everyone waits for the slowest person before moving to the next step. (The "Global Barrier").
ARCLIGHT Way: It uses a smarter system. If Floor 1 finishes early, they can start the next small task immediately, while Floor 2 is still working. They only stop to sync up when absolutely necessary. This keeps everyone busy and moving fast.

3. The Results: Speeding Up the Office

The authors tested this on a massive server with 192 cores (4 floors of 48 cores each).

The Competition: The popular tool llama.cpp was slow because of the elevator traffic (cross-node memory access).
ARCLIGHT: By keeping data local and splitting the work smartly, it was 46% faster.

Why Does This Matter?

Most people think AI needs expensive, rare GPUs. But many companies already have massive CPU servers (used for web hosting and networking).

Before: These servers were too slow to run big AI models efficiently.
Now: With ARCLIGHT, companies can use the hardware they already own to run powerful AI, saving money and making AI more accessible.

In a Nutshell

ARCLIGHT is a clever, lightweight software manager that stops big computer servers from wasting time walking across the building to get data. It organizes the workers so they stay in their own neighborhoods, split the work up so they don't bump into each other, and let them work at their own pace. The result? A much faster, cheaper way to run AI on standard computer chips.

Here is a detailed technical summary of the paper "ARCLIGHT: A LIGHTweight LLM Inference ARChitecture for Many-Core CPUs".

1. Problem Statement

While Large Language Model (LLM) inference frameworks are mature for GPUs and single-node CPUs, they fail to fully exploit the potential of many-core CPU platforms (e.g., web servers, high-end networking devices) which often utilize Non-Uniform Memory Access (NUMA) architectures.

The NUMA Bottleneck: In NUMA systems, CPU cores and memory are partitioned into nodes. Accessing local memory is fast, but accessing remote memory (cross-node) incurs significantly higher latency (up to 4x slower, as measured in the paper).
Limitations of Existing Frameworks: Mainstream frameworks like llama.cpp treat memory as a uniform block (UMA). They often distribute threads across NUMA nodes but fail to bind data (weights/activations) to the local memory of those nodes. This leads to frequent cross-NUMA memory access, creating a "memory access wall" that prevents the system from reaching its theoretical computational ceiling.
Refactoring Difficulty: Retrofitting existing, bloated frameworks to handle NUMA-awareness requires "surgical" refactoring of the entire stack, from low-level memory allocation to high-level model definitions, which is complex and opaque.

2. Methodology: ARCLIGHT Architecture

The authors propose ARCLIGHT, a lightweight, modular inference architecture built from the ground up specifically for many-core CPUs. It consists of approximately 10 C++ header/source files and follows a decoupled design with a high-level decoding frontend and a low-level inference backend.

Key Design Components:

Modular Design: The system is divided into five core modules: Memory Manager, Thread Manager, Tensor Library, Forward Graph Builder, and Graph Computation Scheduler.
NUMA-Aware Memory Management:
- Unlike llama.cpp's monolithic buffer, ARCLIGHT pre-allocates separate memory pools in the local memory of each NUMA node.
- It implements a double-buffering mechanism for activation buffers, alternating based on layer parity to reduce runtime memory consumption.
Multi-View Thread Management:
- Instead of a single thread pool, ARCLIGHT introduces thread groups.
- It supports dynamic reconfiguration, allowing the pool to be split into $n$ groups to execute $n$ independent tensor operations in parallel.
- It distinguishes between local barriers (synchronization within a group) and global barriers (synchronization across the entire pool), enabling flexible synchronization strategies.
Static Computation Graph: The graph is constructed before execution. The system simplifies topological sorting by appending nodes to a sequential container during construction, avoiding re-analysis costs.

3. Core Innovation: Cross-NUMA Tensor Parallelism (TP)

To mitigate the cross-node memory access wall, ARCLIGHT introduces Cross-NUMA Tensor Parallelism.

Weight Partitioning: Weights are partitioned across NUMA nodes. For example, in an MLP layer, matrices are split such that $W_q, W_k, W_v$ are row-partitioned (by attention heads) and $W_o, W_{down}$ are column-partitioned.
Scatter and Gather Operators:
- Scatter: Reconfigures the thread pool into groups and creates "view tensors" for inputs, effectively splitting the computation graph into parallel subgraphs. Each subgraph runs on a specific NUMA node using local data.
- Gather: Collects and sums the output tensors from all subgraphs, merging the results and restoring the thread pool to a single group.
Asynchronous Execution: The system supports two synchronization modes. Empirical results show that asynchronous subgraph execution (where groups proceed without waiting for others at every step, only syncing at global barriers) significantly reduces thread idle time compared to strict global synchronization.

4. Experimental Results

The authors evaluated ARCLIGHT against llama.cpp on a 192-core machine (4 NUMA nodes, 48 Huawei Kunpeng-920 ARM cores per node) using the Qwen3-4B model (Q4_0 quantization).

Single NUMA Node: When threads are bound to a single node, ARCLIGHT slightly outperforms llama.cpp due to better memory locality, as llama.cpp relies on OS page distribution.
Multi NUMA Nodes (The Critical Test):
- When threads are distributed across 2 or 4 NUMA nodes, llama.cpp suffers from severe performance degradation due to cross-node memory access.
- ARCLIGHT with TP significantly outperforms llama.cpp.
- Throughput Gain: ARCLIGHT achieves up to 46% higher inference throughput compared to llama.cpp in multi-NUMA scenarios.
- Asynchronous Gain: The asynchronous execution strategy alone contributed an additional gain of approximately 5 tokens/second.

5. Key Contributions

Lightweight Inference Architecture: A modular, "hackable" framework that distills LLM inference to its core essentials, removing the bloat of traditional frameworks and making it easier for researchers to experiment with CPU-based deployment.
Optimization for Many-Core CPUs: A blueprint for multi-dimensional optimization that addresses the specific challenges of NUMA architectures through:
- NUMA-aware memory and thread management.
- Finely controlled Tensor Parallelism to eliminate cross-node memory access.
- Asynchronous subgraph execution to maximize hardware utilization.

6. Significance and Future Work

Significance: ARCLIGHT demonstrates that CPUs can be highly efficient for LLM inference if the architecture is explicitly designed to respect hardware topology (NUMA). It breaks the performance ceiling imposed by cross-node memory latency, making many-core CPUs viable for scalable, cost-effective LLM serving.
Limitations & Future Work:
- Currently evaluated only on ARM platforms; x86 support is planned for future work.
- The Scatter and Gather operators are preliminary; further optimization is needed to reduce memory overhead and improve parallel efficiency.
- The project is open-sourced to serve as an educational platform and development toolkit.