Optimization over Trained (and Sparse) Neural Networks: A Surrogate within a Surrogate

Imagine you have a giant, incredibly complex map of a city. This map was drawn by a brilliant cartographer (a Neural Network) who studied millions of data points to understand the terrain perfectly. Now, you need to use this map to solve a specific problem: either finding the safest route that avoids all traffic jams (Network Verification) or finding the highest peak in the city (Function Maximization).

The problem? The map is so huge and detailed that trying to navigate it with a standard compass and ruler (a mathematical solver) takes forever. It's like trying to find a specific street in a city with 10 million streets by checking every single one.

The Paper's Big Idea: "The Sketchy Shortcut"

The authors of this paper propose a clever trick. Instead of using the giant, perfect map, they suggest tearing out most of the streets to create a much smaller, "sparse" version of the map.

Here is the surprising twist: They don't even redraw the remaining streets to make them perfect again.

Usually, if you tear up a map, you'd try to fix it by studying the terrain again (a process called "finetuning"). But the authors found that leaving the map messy and incomplete actually helps you solve the problem faster.

The Analogy: The "Rough Draft" vs. The "Polished Novel"

Think of the original neural network as a polished, best-selling novel. It's perfect, but it's heavy and takes a long time to read if you are trying to find a specific plot twist.

The authors' method is like taking that novel, ripping out 90% of the pages, and leaving the remaining pages in a rough, unedited state.

The Old Way: Rip out pages, then spend hours rewriting the remaining pages to make sure the story still makes sense (Finetuning).
The New Way: Rip out pages and just hand the rough, messy stack to your friend.

Why does the messy stack work better?
Because the friend (the optimization solver) is overwhelmed by the perfect novel. They get stuck in the details. But with the rough, sparse stack, the friend can flip through it quickly, spot the general direction, and find a good solution much faster. Even though the story is incomplete, the key clues are still there.

Two Main Games They Played

The researchers tested this "Rough Draft" strategy in two different games:

1. The "Hacker Hunt" (Network Verification)

The Goal: Can a hacker trick the AI? They want to find a tiny change to an input (like adding a few pixels to a picture of a cat) that makes the AI think it's a dog.
The Result: Using the "Rough Draft" (the pruned, un-finetuned network) was much faster at finding these tricks. Even though the rough draft was bad at recognizing cats (it had low accuracy), it was surprisingly good at revealing where the AI was vulnerable.
The Surprise: Trying to "fix" the rough draft by retraining it (finetuning) actually made the process slower and less effective for this specific goal.

2. The "Mountain Climber" (Function Maximization)

The Goal: Find the absolute highest point the AI can predict.
The Result: This was a bit trickier. The "Rough Draft" didn't always find the perfect peak, but it found very high peaks much faster than trying to climb the giant mountain directly. It was like using a drone to spot a high ridge quickly, rather than hiking every inch of the mountain.

The Key Takeaways for Everyday Life

Perfection is the enemy of speed: Sometimes, a "good enough" model that is simple and sparse is better than a "perfect" model that is too heavy to use.
Don't over-fix your shortcuts: If you create a simplified version of a complex problem, don't waste time trying to make the simplified version perfect again. The imperfections actually help the computer solve the problem faster.
Less is more: By removing 90% of the connections in a neural network, you didn't lose the ability to solve the problem; you just removed the "noise" that was slowing you down.

In a Nutshell

The paper tells us that when we are trying to solve hard math problems using AI, we shouldn't be afraid to use a crude, messy, and incomplete version of the AI. By stripping away the excess weight and not bothering to polish it back up, we can solve problems faster and often find better solutions than if we tried to use the giant, perfect original.

It's the difference between trying to navigate a city with a 1,000-page encyclopedia versus a 10-page sketch on a napkin. Sometimes, the napkin gets you where you need to go much quicker.

1. Problem Statement

The paper addresses the challenge of Constraint Learning, where a neural network (NN) is embedded into a mathematical optimization model (e.g., Mixed-Integer Linear Programming or MILP) to represent complex, non-linear constraints or objective functions.

The Bottleneck: While NNs are powerful surrogates for unknown or intractable functions, embedding a large, dense, pre-trained neural network into an optimization model often renders the resulting MILP intractable. The size of the NN directly correlates to the number of binary variables and constraints in the MILP, making it difficult for solvers to find solutions within reasonable time limits.
The Gap: Existing approaches often suggest pruning the network before training or retraining (finetuning) the pruned network to recover accuracy. However, in many real-world scenarios, the neural network is already given and fixed (e.g., a pre-trained model for verification or a policy from reinforcement learning). Retraining is often impossible due to lack of training data or computational overhead.
Core Question: How can we solve an optimization model embedding a large, predetermined neural network without retraining it?

2. Methodology

The authors propose a "Surrogate within a Surrogate" approach. Instead of solving the optimization problem over the original dense network ( $D$ ), they solve it over a pruned, sparse version ( $S$ ) of that same network.

A. Network Pruning Strategy

Pruning Type: The study focuses on unstructured pruning (removing individual weights) and structured pruning (removing entire neurons).
Pruning Criteria: They compare Magnitude Pruning (MP) (removing weights with the smallest absolute values) against Random Pruning (RP).
The "No-Finetuning" Hypothesis: A key innovation is the decision not to finetune the pruned network. The authors hypothesize that even if the pruned network has poor inference accuracy (acting as a "bad" classifier/regressor), it may still serve as a "good" surrogate for optimization because the sparsity significantly reduces the complexity of the MILP formulation.

B. Algorithmic Framework

The paper defines two specific applications and corresponding heuristic algorithms:

Network Verification (Adversarial Robustness):
- Goal: Find an adversarial perturbation $\epsilon$ such that the NN misclassifies an input.
- Algorithm 1: The solver attempts to find an adversarial input on the sparse network ( $S$ ). If a candidate input $x$ is found, it is evaluated on the dense network ( $D$ ). If $x$ is adversarial for $D$ , the search stops and returns $x$ . If not, the search continues on $S$ .
- Logic: The sparse model is easier to solve, allowing the MILP solver to explore the search space faster and generate candidate inputs more frequently.
Function Maximization:
- Goal: Find the input $x$ that maximizes the output of the NN.
- Algorithm 2: Similar to verification, the solver finds feasible solutions on the sparse network ( $S$ ). Each time a solution is found, it is evaluated on the dense network ( $D$ ). The algorithm maintains the best solution found so far for $D$ and continues searching on $S$ until the time limit is reached.

3. Key Contributions

The "Surrogate within a Surrogate" Paradigm: The paper establishes that a pruned, non-finetuned neural network can serve as a highly effective surrogate for optimization, even if it performs poorly as a standalone predictor.
The Surprising Role of Finetuning: The authors demonstrate that skipping the finetuning step is often beneficial. Finetuning adds computational cost and can sometimes make the sparse network's decision boundaries too similar to the dense one, negating the speed advantage of the sparsity. In many cases, a "worse" (unfinetuned) pruned network yields better optimization results than a "better" (finetuned) one.
Unstructured Magnitude Pruning (MP): The study identifies unstructured MP as the most effective and cost-efficient pruning method for this specific optimization context, outperforming structured pruning and random pruning in most scenarios.
Empirical Validation: Extensive experiments on MNIST and Fashion-MNIST datasets (for verification) and randomly initialized large networks (for maximization) validate the approach.

4. Experimental Results

A. Network Verification (Adversarial Perturbations)

Performance: The indirect approach (solving on pruned $S$ $S$ , checking on dense $D$ $D$ ) significantly outperformed direct solving on $D$ $D$ .
- For 90% pruning, the indirect approach found adversarial inputs faster in 98.8% of instances (MNIST) and 93.8% (Fashion-MNIST) compared to direct solving.
- It drastically reduced the number of timeouts (instances where no solution was found within the time limit).
Finetuning Impact:
- Low Pruning Rates: Not finetuning was superior.
- High Pruning Rates: Finetuning offered marginal improvements in success rate but incurred a high computational cost. When finetuning time was included, the direct approach was often faster.
- Accuracy vs. Optimization: Pruned networks with accuracy near random guessing (10%) were still highly effective at finding adversarial examples for the original network.
Pruning Type: Unstructured MP was generally superior to Structured MP and Random Pruning.

B. Function Maximization

Performance: The approach yielded better maximum values for the dense network in 50–80% of instances, particularly for larger networks.
Network Dimensions: The method was most effective for networks with:
- Larger layer widths (more neurons per layer).
- Larger input sizes.
- Higher pruning rates (up to 95%).
Monotonicity: Unlike verification, the improvement in function maximization was not strictly monotonic with pruning rate; however, the highest pruning rates (95%) consistently produced the best results for the largest networks.

5. Significance and Conclusion

This paper provides a counter-intuitive but practical insight for the field of Constraint Learning and Neural Network Verification:

Tractability over Accuracy: In optimization contexts, the structural sparsity of a model is often more valuable than its predictive accuracy. A sparse model reduces the MILP search space, allowing solvers to find high-quality solutions faster.
Cost-Efficiency: By avoiding the expensive retraining (finetuning) phase, practitioners can immediately utilize pre-trained models in optimization workflows. This is crucial for applications where training data is unavailable or the model is a "black box."
Practical Application: The proposed method offers a scalable solution for verifying safety properties of large AI systems or optimizing control policies, where solving the full dense model is computationally prohibitive.

In summary, the authors prove that pruning a neural network without retraining it creates a "surrogate within a surrogate" that accelerates optimization solvers, often leading to better solutions within time limits than solving the original dense model directly.