PASS: Certified Subset Repair for Classical and Quantum Pairwise Constrained Clustering

Imagine you are organizing a massive party with hundreds of guests. Your goal is to seat everyone at tables so that people who get along sit together, and the overall "awkwardness" of the room is minimized. This is essentially what clustering does in data science: it groups similar items together.

However, real life (and real data) comes with rules. Some guests must sit together (Must-Link), and some guests cannot sit at the same table under any circumstances (Cannot-Link).

The paper introduces a new method called PASS (Certified Subset Repair) to solve this seating chart problem, even when the party is huge and the rules are strict. Here is how it works, broken down into simple concepts:

1. The Problem: The "Impossible" Seating Chart

Standard party planners (algorithms) try to move everyone around to find the perfect arrangement. But when you have thousands of guests and strict "no-sitting-together" rules, the math gets incredibly hard.

The Scale Issue: If you try to calculate the best seat for every single guest at once, the computer gets overwhelmed. It's like trying to solve a Sudoku puzzle where the grid is the size of a football field.
The Conflict Issue: Sometimes the rules contradict each other. If Guest A must sit with Guest B, but Guest B cannot sit with Guest C, and Guest C must sit with Guest A, you have a broken rule set. The computer gets stuck trying to fix it.

2. The PASS Solution: The "Focus Group" Strategy

Instead of trying to fix the seating for the whole party at once, PASS uses a clever trick: it only focuses on the trouble spots.

Think of it like a teacher managing a chaotic classroom. Instead of yelling at the whole room to be quiet, the teacher identifies the three students who are arguing and the two students who are confused about where to sit. The teacher tells everyone else, "You guys stay exactly where you are; you're doing fine."

Step 1: The "Must-Link" Collapse (Gluing Friends Together)
If two guests must sit together, PASS treats them as a single "super-guest." It glues them together into one weighted unit. This shrinks the problem size immediately.

Step 2: The "Working Set" (The Focus Group)
PASS identifies a tiny subset of guests who are either:

Ambiguous: They are on the fence about which table to sit at.
Violators: They are currently sitting with someone they aren't supposed to be with.

It freezes the rest of the party. The computer only does the heavy math on this tiny "working set" (maybe 5% of the guests). This makes the problem small enough for even a quantum computer to handle.

Step 3: The "Certified Repair" (The Safety Net)
This is the paper's biggest innovation. When the computer rearranges the "working set," it needs to make sure it didn't accidentally break a rule with the guests who were frozen.

PASS uses a mathematical concept called List Coloring. Imagine every guest in the working set has a list of "allowed tables" based on who their frozen neighbors are sitting at.

The Certificate: The algorithm doesn't just guess; it produces a "certificate" (a proof) that says, "I have checked every rule, and this new arrangement is 100% valid."
If it can't find a valid arrangement, it doesn't just crash; it tells you exactly why it failed and which specific rules are impossible to satisfy.

3. The Quantum Twist: The "Super-Computer" Shortcut

The paper also tests this method on Quantum Computers (specifically NISQ devices, which are the noisy, early-stage quantum computers we have today).

The Bottleneck: Quantum computers are great at solving complex puzzles, but they have very few "qubits" (memory slots). Trying to map a party of 10,000 people onto a quantum chip is impossible because there aren't enough slots.
The PASS Fix: Because PASS shrinks the problem down to just the "troublemakers" (the working set), it fits the puzzle onto the quantum chip.
The Result: The quantum computer solves the tiny, hard part of the puzzle, and the classical computer handles the rest. This hybrid approach finds better solutions faster than trying to force the whole problem onto the quantum chip.

Summary Analogy

Imagine you are trying to untangle a giant knot of headphones.

Old Methods: You pull on the whole knot at once. It gets tighter, and you get frustrated.
PASS: You find the one specific loop that is causing the tangle. You hold the rest of the headphones still. You focus all your energy on untying just that one loop. Once that loop is free, the whole knot falls apart. If that specific loop is tied in a way that cannot be untied, PASS gives you a "certificate" proving it's impossible, so you don't waste time trying.

Why does this matter?
PASS allows us to organize massive datasets (like millions of medical records or financial transactions) with strict privacy or safety rules, using both standard computers and the emerging power of quantum computing, without getting stuck in the math.

Here is a detailed technical summary of the paper "PASS: Certified Subset Repair for Classical and Quantum Pairwise Constrained Clustering."

1. Problem Definition

The paper addresses Pairwise-Constrained $k$ -means Clustering, a semi-supervised learning task where the goal is to minimize the Sum of Squared Errors (SSE) while satisfying two types of constraints:

Must-Link (ML): Two samples must belong to the same cluster.
Cannot-Link (CL): Two samples must belong to different clusters.

Challenges:

Computational Complexity: Constrained $k$ -means is NP-hard. Standard $k$ -means updates (Lloyd's algorithm) do not guarantee feasibility, and exact methods (e.g., Integer Linear Programming) do not scale to large datasets or dense constraint graphs.
Infeasibility: Constraint sets can be inconsistent (e.g., a cycle of CL constraints that cannot be satisfied), leading to no feasible solution.
Quantum Limitations: Encoding pairwise constraints into Quantum Annealing or QAOA (Quantum Approximate Optimization Algorithm) requires mapping dense logical couplings to sparse hardware topologies. This results in prohibitive overhead ( $O(N \times K)$ variables) for current Noisy Intermediate-Scale Quantum (NISQ) devices.

2. Methodology: The PASS Framework

The authors propose PASS (Pairwise constrained Assignment via Subset Selection), a scalable framework that restricts optimization to a small "working subset" of data points while maintaining global feasibility guarantees. The framework operates in four phases:

Phase 1: Must-Link Collapse

Mechanism: All samples connected by ML constraints are contracted into single weighted pseudo-points.
Effect: This reduces the problem size and removes ML constraints from the optimization formulation, converting the problem into a weighted MSSC with only CL constraints. The objective function is preserved up to an additive constant.

Phase 2: Working-Set Selection

Instead of optimizing over all $N$ points, PASS selects a small subset $S \subset \hat{S}$ (where $\hat{S}$ is the collapsed dataset) to re-optimize. Points outside $S$ have their assignments "frozen." Two selection strategies are proposed:

PASS-CA (Constraint-Aware): Selects points involved in CL violations and points near decision boundaries (low margin to the second-best cluster).
PASS-IG (Information Geometric): Selects points with high Fisher-Rao ambiguity scores (high uncertainty in soft assignments) and all CL violation endpoints.

Guarantee: The selection ensures that all current CL violations are contained within $S$ , allowing the rest of the dataset to remain fixed without introducing new violations.

Phase 3: Restricted Optimization

Classical: The restricted assignment problem on $S$ is solved as a 0-1 Integer Linear Program (ILP). The objective minimizes the incremental SSE relative to the current centroids, subject to CL constraints within $S$ and compatibility with frozen neighbors outside $S$ .
Quantum (q-PCKMeans): The restricted problem is formulated as a QUBO (Quadratic Unconstrained Binary Optimization).
- Variable Reduction: Variables scale with $O(|S| \times K)$ rather than $O(N \times K)$ , making it viable for NISQ devices.
- Mixer Design: Uses a one-hot preserving XY mixer to ensure the quantum state remains within the feasible subspace of valid cluster assignments (one cluster per point) during evolution.
- Penalties: Hard constraints are encoded as penalty terms in the Hamiltonian.

Phase 4: Certified Repair

This is a core theoretical contribution. When the restricted optimization yields a solution, PASS verifies if the global assignment is feasible.

List-Coloring Formulation: The repair problem is modeled as a list-coloring problem on the induced constraint subgraph $G[S]$ . Each node $i \in S$ has a list of allowed colors (clusters) based on the frozen labels of its neighbors outside $S$ .
Local Slack Certificate (Theorem 4.2): The authors derive a sufficient condition for repairability:
$\min_{i \in S} |L_i(a, S)| \geq \text{deg}(G[S]) + 1$
Where $|L_i|$ is the size of the allowed label list and $\text{deg}(G[S])$ is the graph degeneracy.
Outcome: If the condition holds, a valid repair is constructed via greedy list coloring, and a verifiable witness (certificate) is returned. If the condition fails or the frozen assignment is infeasible, the system reports residual violations or obstruction evidence.

3. Key Contributions

Certified Subset Repair: A novel framework that treats constrained clustering updates as a repair problem. It provides a checkable certificate (based on list coloring and graph degeneracy) that guarantees feasibility under restricted updates.
Scalability via Reduction: By collapsing ML components and selecting a small working set $S$ , PASS reduces the problem size significantly, enabling solutions for datasets with millions of points where exact baselines fail.
Quantum-Hybrid Pipeline: The first systematic approach to applying quantum optimization (QAOA) to pairwise-constrained clustering. By reducing the problem to a small subset, it bypasses the qubit bottleneck of full encodings, allowing hybrid evaluation on simulated NISQ hardware.
Theoretical Guarantees: Formal proofs for cost equivalence after ML collapse, feasibility preservation across subset boundaries, and sufficient conditions for repairability.

4. Experimental Results

The paper evaluates PASS on 12 datasets (ranging from 150 to 4.2 million points) and compares it against state-of-the-art baselines (COP-k-means, BLPKM-CC, PCCC) and quantum baselines (CP-QAOA).

Classical Performance:
- Scalability: PASS successfully solved instances with 4 million+ points within a 1-hour time limit, whereas strong baselines (like PCCC) often failed to return a solution or timed out.
- Quality: PASS achieved competitive SSE (Sum of Squared Errors) comparable to PCCC and COP-k-means on smaller datasets, often with lower runtime.
- Feasibility: PASS-IG (Information Geometric) consistently achieved zero constraint violations across benchmarks, whereas other methods often left residual violations.
Quantum Performance (Simulation):
- q-PCKMeans vs. CP-QAOA: The reduced subset approach (q-PCKMeans) significantly outperformed the full-encoding CP-QAOA.
- SSE Gains: q-PCKMeans achieved 6–84% lower SSE compared to CP-QAOA under simulation protocols (p=1, 2048 shots).
- Feasibility: The reduced variable count allowed q-PCKMeans to find feasible solutions with near-zero violation rates, whereas full encodings often failed due to noise and limited circuit depth.

5. Significance

Bridging Classical and Quantum: PASS provides a practical pathway for using NISQ devices for constrained clustering. It demonstrates that problem reduction is essential for quantum advantage in combinatorial optimization, as current hardware cannot handle full-scale encodings.
Trustworthy AI: The "Certified Repair" mechanism addresses a critical gap in constrained optimization: the lack of guarantees. By providing a verifiable certificate of feasibility, PASS ensures that solutions are not just approximate but mathematically verified to satisfy constraints (or explicitly report why they cannot).
Handling Inconsistency: The framework explicitly handles inconsistent constraint sets, distinguishing between "no solution exists" and "our method couldn't find one," providing obstruction evidence rather than silent failure.

In summary, PASS offers a robust, scalable, and theoretically grounded solution for pairwise-constrained clustering, effectively leveraging subset selection to enable both high-performance classical solvers and viable quantum-hybrid workflows.