Exploiting Parallelism in a QPALM-based Solver for Optimal Control

Imagine you are the captain of a massive spaceship trying to navigate a complex asteroid field. You need to plot the perfect course for the next hour, second by second, to avoid crashes while using the least amount of fuel. This is a classic Optimal Control problem.

In the real world, this math happens in self-driving cars, drones, and robots. They have to make split-second decisions: "Should I turn left now? Should I brake?" To do this, they solve a giant math puzzle called a Quadratic Program (QP) thousands of times per second.

The paper you're asking about is about making the computer that solves this puzzle much, much faster.

Here is the breakdown of their solution, using simple analogies:

1. The Problem: A Long Line of Workers

Imagine the spaceship's journey is broken down into 100 tiny time-steps (stages). To solve the navigation puzzle, a computer has to do a specific calculation for each of those 100 steps.

Traditionally, a computer might do this like a single worker in a factory:

Calculate Step 1.
Put the result down.
Calculate Step 2.
Put the result down.
...and so on until Step 100.

This is slow. If you have a modern computer with 8 powerful "brains" (cores) and a super-fast "assembly line" (SIMD vectorization), letting just one brain do all the work is a waste of money and power.

2. The Solution: The "Compact Storage" Trick (The Conveyor Belt)

The authors realized that the math for Step 1, Step 2, Step 3, etc., is almost identical. It's like doing the same recipe, but with slightly different ingredients.

They introduced a clever way to organize the data called "Compact Storage."

The Old Way (Naive): Imagine you have 100 boxes of ingredients. You put Box 1 on the table, do the work, clear it, put Box 2 on the table, do the work. You are constantly moving boxes around.
The New Way (Compact): Imagine you line up the ingredients from Box 1, Box 2, Box 3, and Box 4 side-by-side on a giant conveyor belt.
- Instead of doing "Box 1, then Box 2," your machine grabs the flour from all four boxes at once, then the sugar from all four boxes at once.
- This is called SIMD (Single Instruction, Multiple Data). It's like a chef chopping four onions simultaneously with a wide knife instead of chopping one onion four times.

By rearranging how the data sits in the computer's memory, they allow the processor to crush through multiple time-steps in a single heartbeat.

3. The Solution: The "Team of Workers" (Parallelization)

Even with the conveyor belt, if you have 1,000 time-steps, one conveyor belt might still be too slow.

The authors also used OpenMP, which is like hiring a team of workers instead of just one.

They split the 1,000 time-steps into 8 chunks.
They assigned one chunk to each of the 8 computer cores.
Now, instead of one person doing the work for an hour, 8 people are doing it simultaneously, finishing in roughly 1/8th of the time.

4. The Results: Speeding Up the Race

The authors tested their new "Super-Solver" (called QPALM-OCP) against the old standard solvers.

The Test: They used a classic physics problem: a chain of masses connected by springs (like a slinky). They made the slinky longer and longer (more masses = more complex math).
The Outcome:
- For a medium-sized problem, their new solver was 29 times faster than the old standard.
- For a problem with a specific simple structure, it was 65 times faster.
- They also tested it on a "quadruped" (four-legged robot) walking problem. Their solver was roughly 4 to 5 times faster.

Why Does This Matter?

In the world of robotics and self-driving cars, time is safety.

If a self-driving car takes 10 milliseconds to decide to brake, it might be too late.
If it takes 1 millisecond, it can react instantly.

By making the math solver 20 to 60 times faster, this paper helps robots and cars think faster, react sooner, and operate more safely in the real world. They didn't invent a new type of math; they just figured out how to organize the work so the computer's hardware can do it in parallel, like a well-oiled assembly line instead of a lonely worker.

Here is a detailed technical summary of the paper "Exploiting Parallelism in a QPALM-based Solver for Optimal Control."

1. Problem Statement

The paper addresses the computational challenges associated with solving Linear-Quadratic Optimal Control Problems (OCPs), which are fundamental to applications like Linear Model Predictive Control (MPC) and Moving Horizon Estimation (MHE).

Context: These problems often arise in real-time, embedded environments where solvers must be highly efficient.
Core Challenge: While the QPALM algorithm (an augmented Lagrangian-based solver) was recently adapted for OCPs (QPALM-OCP) to handle equality constraints directly and exploit block structures, the computational cost of the inner semismooth Newton solver remains a bottleneck.
Goal: To further reduce solver runtimes by exploiting the stage-wise structure of OCPs through parallelization and vectorization in a high-performance C++ implementation.

2. Methodology

The authors propose a two-level parallelization strategy applied to the QPALM-OCP algorithm, specifically targeting the solution of the linear systems within the inner semismooth Newton solver.

A. Algorithmic Foundation (QPALM-OCP)

The solver minimizes a piecewise quadratic augmented Lagrangian function subject to equality constraints. The core computational step involves solving a linear system defined by a generalized Hessian matrix $H_k(x)$ .

Structure Exploitation: The Hessian matrix $H_k(x)$ is block-diagonal, where each block $H_{k,j}$ corresponds to a specific time stage $j$ in the control horizon.
Decomposition: The solution involves factorizing these independent blocks and then solving a reduced system involving the matrix $\Psi = M H^{-1} M^T$ , which is block-tridiagonal.

B. Level 1: Vectorization (SIMD)

To leverage Single Instruction, Multiple Data (SIMD) capabilities of modern CPUs:

Compact Storage Format: The authors introduce a "compact" memory layout. Instead of storing matrices for each stage contiguously (naive format), matrices from different stages (e.g., $A_0, A_1$ ) are interleaved in memory.
Mechanism: This allows the CPU to load elements from multiple stages into vector registers simultaneously. For example, with a vector length of 2, the operation $A_0 x_0$ and $A_1 x_1$ are computed in the same instruction cycle.
Implementation: The authors implemented custom linear algebra routines (based on the BLIS framework) using C++ templates and std::simd to handle these batched operations, avoiding the overhead found in standard libraries like Intel MKL for small matrix batches.

C. Level 2: Multi-threading (OpenMP)

To utilize multi-core hardware:

Stage Distribution: The control horizon $N$ is divided into blocks. These blocks are distributed across multiple physical CPU cores using OpenMP.
Parallelizable Operations: All operations up to the factorization of the block-tridiagonal matrix $\Psi$ are fully independent across stages and are parallelized.
Sequential Bottleneck: The recursive factorization of $\Psi$ remains sequential but benefits from parallel linear algebra operations within the small $n_x \times n_x$ blocks.

3. Key Contributions

Dual-Level Parallelization: The paper successfully integrates SIMD vectorization (via compact storage) and multi-threading (via OpenMP) into a single solver framework.
Custom Linear Algebra Routines: The authors developed optimized, custom BLAS-like routines specifically for the compact storage format, outperforming generic library calls for small, batched matrix operations.
Specialized Solver for OCPs: The implementation is tailored specifically to the block-diagonal and block-tridiagonal structures inherent in optimal control, rather than treating the problem as a generic Quadratic Program (QP).
Comprehensive Benchmarking: The work provides extensive comparisons against state-of-the-art solvers (QPALM, PIQP, OSQP, HPIPM) across various problem sizes and structures.

4. Results

The performance was evaluated on an octa-core Intel Core i7-11700 using standard benchmarks (Spring-Mass system) and the qpsolvers/mpc_qpbenchmark repository.

Spring-Mass Benchmark (Diagonal Structure):
- For a problem with 3275 primal variables, the dense QPALM-OCP was 29x faster than the standard dense QPALM and 19x faster than the pruned sparse QPALM.
- The diagonal QPALM-OCP (exploiting specific diagonal structure) achieved a 65x speedup over dense QPALM and 43x over pruned sparse QPALM.
Vectorization Impact:
- In single-threaded mode, vectorization provided a 2.3x speedup.
- Multi-threading (8 threads) provided further gains, though performance was eventually limited by cache bandwidth and the sequential factorization of $\Psi$ .
Quadruped Locomotion Benchmarks (QUADCMPC):
- The dense QPALM-OCP significantly outperformed the sparse QPALM solver even on sparse problems (e.g., 5.1 ms vs 21.2 ms for QUADCMPC1).
- For very small problems, the overhead of threading was negligible, and QPALM-OCP remained competitive or superior (0.43 ms vs 0.46 ms).

5. Significance

This work demonstrates that hardware-aware optimization is critical for real-time control applications. By moving beyond generic sparse matrix solvers and explicitly designing the data layout and execution flow to match the stage-wise independence of OCPs and the vector capabilities of modern CPUs, the authors achieved order-of-magnitude improvements in solver speed.

Practical Impact: The results suggest that specialized solvers like QPALM-OCP can enable MPC on faster time scales or for larger horizons on embedded hardware, potentially replacing slower generic solvers.
Future Directions: The authors note that future work will focus on offline matrix packing optimization and implementing factorization update routines to avoid full refactorizations when constraints change slightly, further enhancing real-time performance.