GPU-friendly and Linearly Convergent First-order Methods for Certifying Optimal $k$-sparse GLMs

Imagine you are a detective trying to find the perfect suspect in a lineup of 100,000 people. You know the culprit is one of them, but you also know a crucial fact: the culprit only has 10 distinct characteristics (like wearing a red hat, having a scar, etc.). You need to find the exact combination of 10 traits that explains the crime scene perfectly.

This is what Sparse Generalized Linear Models (GLMs) are: finding the best model using only a few key features. The problem is, checking every possible combination of 10 people out of 100,000 is mathematically impossible to do by hand. It's like trying to find a specific grain of sand on a beach by checking every single grain one by one.

This paper introduces a new, super-fast way to prove you've found the absolute best solution, not just a "pretty good" guess. Here's how they did it, broken down into simple concepts:

1. The Problem: The "Perfect" vs. The "Good Enough"

Usually, when computers try to solve this, they use shortcuts (heuristics) or "relaxations" (loosening the rules to make the math easier).

The Old Way: Imagine trying to solve a puzzle by looking at a blurry, low-resolution photo. You can guess the picture, but you can never be 100% sure it's the right one. To get the "perfect" answer, you have to use a method called Branch-and-Bound (BnB). This is like a tree search where you split the problem into smaller pieces.
The Bottleneck: At every branch of the tree, the computer has to solve a difficult math problem to prove, "Okay, this path cannot contain the best solution, so let's cut it off." The old way of doing this was like trying to solve a Rubik's cube with a spoon—it worked, but it was incredibly slow and couldn't use modern super-computers (GPUs) effectively.

2. The Solution: A New "GPS" for the Search

The authors built a new engine to navigate this search tree. They didn't just make the old engine faster; they redesigned the whole vehicle.

A. The "Smooth" Path (Composite Reformulation)

Think of the math problem as a rugged, rocky mountain. The old methods tried to climb it by taking tiny, careful steps, often getting stuck in valleys.
The authors realized they could reshape the mountain into a smooth, rolling hill. By reformulating the problem, they turned a jagged, difficult shape into something a computer can roll down very quickly. This is called a "composite optimization" problem.

B. The "Restart" Trick (Linear Convergence)

Imagine you are running down a hill. Sometimes, you get a bit of momentum and overshoot the bottom, then have to run back up a little. This "wobble" slows you down.

The Old Trick: Some runners just keep running, hoping they eventually stop wobbling.
The New Trick: The authors invented a Restart Scheme. They put a sensor on the runner that measures the "gap" between where they are and the bottom of the hill. As soon as the runner stops getting closer fast enough, the sensor yells, "STOP! Reset!" The runner is instantly teleported back to the start of the current sprint, but with a fresh burst of energy.
The Result: Instead of wobbling for hours, the runner zooms straight to the bottom. In math terms, this turns a "slow" method into a linearly convergent one, meaning the error drops by a fixed percentage with every step, guaranteeing a super-fast finish.

C. The "GPU" Supercharger

Modern computers have GPUs (Graphics Processing Units), which are like having 10,000 tiny workers all doing simple tasks at the same time.

The Problem: The old math methods were like a single master chef trying to chop 10,000 onions one by one. You can't use the 10,000 workers because the chef has to do everything in a specific order.
The Fix: The authors designed their new math routine so that the "chopping" (matrix-vector multiplication) is the main task. This is perfect for GPUs. Now, instead of one chef, you have 10,000 workers chopping onions simultaneously.
The Speedup: They created special, custom tools to do the math that no one else had. Instead of using a generic, heavy-duty tool (like a sledgehammer) to do a delicate job, they built a laser scalpel. This allowed them to use the GPU's full power, making the calculations 10 to 100 times faster.

3. The Real-World Impact

Why does this matter?

Healthcare: Imagine a doctor trying to diagnose a disease based on 50,000 genetic markers. They need to know exactly which 5 markers cause the disease, with 100% certainty. The old methods might take days or give a "maybe" answer. This new method can find the proven answer in minutes.
Finance: Banks need to detect fraud. If they can prove a transaction is definitely fraudulent (or definitely safe) much faster, they can stop crimes in real-time.

The Bottom Line

The authors took a problem that was too hard to solve perfectly on a large scale and:

Smoothed out the math so it's easier to solve.
Invented a "Restart" button that guarantees the solution is found quickly without getting stuck.
Built custom tools that let modern super-computers (GPUs) do the heavy lifting in parallel.

The result? They can now certify the "perfect" solution for massive problems in a fraction of the time it used to take, turning a "maybe" into a "definitely." It's like upgrading from a bicycle to a rocket ship for solving some of the trickiest puzzles in data science.

1. Problem Statement

The paper addresses the challenge of certifying global optimality for Sparse Generalized Linear Models (GLMs) subject to a cardinality constraint ( $\ell_0$ -norm). The optimization problem is formulated as:
$\min_{\beta} \{ f(X\beta, y) + \frac{\lambda_2}{2}\|\beta\|_2^2 : \|\beta\|_0 \leq k \}$
where $f$ is a convex loss function (e.g., least squares, logistic loss), $X$ is the feature matrix, and $k$ limits the number of non-zero coefficients.

Context: While heuristic methods (like Lasso) or convex surrogates are scalable, they often fail to find the true global optimum, especially with high-dimensional or highly correlated features. This is critical in high-stakes domains like healthcare.
Current Approach: The standard method for exact optimization is Branch-and-Bound (BnB) using Mixed-Integer Programming (MIP).
The Bottleneck: BnB requires computing tight lower bounds at every node of the search tree. The standard "Big-M" relaxation yields weak bounds. The Perspective Relaxation provides significantly tighter bounds but results in a complex conic convex program.
- Limitations of existing solvers: Interior-Point Methods (IPM) used for perspective relaxations do not scale well (cubic complexity, hard to parallelize on GPUs) and cannot be warm-started.
- Limitations of existing First-Order Methods (FOMs): While FOMs are scalable and GPU-friendly, they typically suffer from sublinear convergence rates ( $O(1/k)$ or $O(1/k^2)$ ). This makes it difficult to obtain the precise lower bounds required for efficient pruning in BnB within a reasonable time.

2. Methodology

The authors propose a unified framework that combines geometric analysis, duality theory, and specialized algorithmic routines to solve perspective relaxations efficiently.

A. Composite Reformulation

The perspective relaxation is reformulated as an unconstrained convex composite optimization problem:
$\min_{\beta} \Phi(\beta) = F(X\beta) + G(\beta)$
where $F$ is the smooth loss function and $G(\beta) = 2\lambda_2 g_N(\beta)$ is a non-smooth regularizer defined implicitly by the perspective function and cardinality constraints.

B. Geometric Analysis & Linear Convergence

The core theoretical contribution is the analysis of the geometric structure of the primal and dual problems:

Primal Quadratic Growth: Under specific regularity conditions, the primal objective $\Phi(\beta)$ grows quadratically away from the optimal set.
Dual Quadratic Decay: The authors introduce a novel concept where the dual objective $\Psi(\zeta)$ decays quadratically away from the unique dual solution.
Fenchel Duality Gap: They establish that the duality gap ( $\Phi(\beta) - \Psi(\zeta)$ ) serves as a sharp proxy for the distance to the solution set.

C. Duality-Gap-Based Restart Scheme

Leveraging the geometric properties, the authors propose a restart scheme governed by the duality gap:

Instead of relying on heuristic restarts (based on function values or momentum), the algorithm runs until the duality gap decreases by a fixed factor $\eta > 1$ .
At this point, the algorithm restarts with the current iterate as the new initialization.
Result: This transforms a broad class of sublinear proximal methods (e.g., FISTA, AC-FGM) into provably linearly convergent methods for both primal and dual objectives. This is the first time such a guarantee has been established for computing safe lower bounds in this specific MINLP setting.

D. GPU-Friendly Implementation

To ensure the method is practical for large-scale problems:

Implicit Regularizer Evaluation: They derive specialized algorithms to evaluate the implicit regularizer $g_N(\beta)$ $g_{N} (β)$ and its proximal operator exactly in log-linear time ( $O(p \log p)$ $O (p lo g p)$ ).
- This avoids the need for expensive generic Second-Order Cone Programming (SOCP) solvers.
- The evaluation relies on sorting and a "peeling" algorithm based on majorization theory.
GPU Acceleration: The dominant operations in the proposed method are matrix-vector multiplications and sorting, which are highly parallelizable. This allows for significant speedups on GPUs compared to CPU-bound solvers.

3. Key Contributions

Theoretical Breakthrough: Proved that perspective relaxations for sparse GLMs satisfy conditions for linear convergence of first-order methods via a novel duality-gap-based restart scheme.
Algorithmic Innovation: Developed specialized, exact, and efficient routines for evaluating the perspective regularizer and its proximal operator, bypassing the need for generic conic solvers.
Hardware Optimization: Designed a "GPU-friendly" architecture where the computational bottleneck is matrix-vector multiplication, enabling massive parallelism.
Scalability: Demonstrated that the method can handle large-scale instances ( $p \approx 10^5$ ) that are intractable for standard MIP solvers.

4. Experimental Results

The authors validated their approach on synthetic and real-world datasets (Santander, DOROTHEA) against state-of-the-art solvers (Gurobi, MOSEK, SCS, Clarabel).

Lower Bound Computation:
- The proposed method is 1–2 orders of magnitude faster than the fastest conic solvers (MOSEK) on CPUs.
- On GPUs, it achieves an additional order of magnitude speedup (total ~100x faster than CPU baselines) for high-dimensional instances.
- It successfully solves instances where commercial solvers fail to converge within time limits (e.g., 1800s).
Convergence Behavior:
- Experiments confirmed the theoretical linear convergence rate when the restart scheme is applied, whereas standard accelerated methods without restarts exhibited sublinear convergence.
BnB Performance:
- When integrated into a BnB framework, the method certifies optimality 1–2 orders of magnitude faster than Gurobi and MOSEK.
- It explores a comparable number of nodes but spends significantly less time computing the lower bounds at each node.
- For large instances where Gurobi/MOSEK run out of memory or time, the proposed method consistently reaches 0% optimality gap.

5. Significance

This paper bridges the gap between theoretical optimization and practical scalability for sparse learning.

For Optimization Theory: It provides a new geometric perspective (dual quadratic decay) and a generic restart mechanism that guarantees linear convergence for a wide class of composite problems, not just perspective relaxations.
For Machine Learning & Statistics: It enables the training of certifiably optimal sparse models (e.g., for medical risk scores or feature selection) on large datasets, moving beyond the limitations of convex relaxations (Lasso) which may yield suboptimal solutions.
For High-Performance Computing: It demonstrates how to effectively leverage GPU parallelism for discrete optimization problems by avoiding sequential subroutines (like linear system solves in IPM) and focusing on matrix-vector operations.

In summary, the work presents a complete pipeline—from theoretical convergence guarantees to GPU-optimized implementation—that makes certifying optimality for sparse GLMs feasible at scales previously considered impossible.

GPU-friendly and Linearly Convergent First-order Methods for Certifying Optimal kkk-sparse GLMs

1. The Problem: The "Perfect" vs. The "Good Enough"

2. The Solution: A New "GPS" for the Search

A. The "Smooth" Path (Composite Reformulation)

B. The "Restart" Trick (Linear Convergence)

C. The "GPU" Supercharger

3. The Real-World Impact

The Bottom Line

1. Problem Statement

2. Methodology

A. Composite Reformulation

B. Geometric Analysis & Linear Convergence

C. Duality-Gap-Based Restart Scheme

D. GPU-Friendly Implementation

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

GPU-friendly and Linearly Convergent First-order Methods for Certifying Optimal $k$ -sparse GLMs

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank