Algebraic Structure Discovery for Real World Combinatorial Optimisation Problems: A General Framework from Abstract Algebra to Quotient Space Learning

Min Sun (F. Hoffmann-La Roche AG, Roche Pharma Research and Early Development), Federica Storti (F. Hoffmann-La Roche AG, Roche Pharma Research and Early Development), Valentina Martino (F. Hoffmann-La Roche AG, Roche Pharma Research and Early Development), Miguel Gonzalez-Andrades (F. Hoffmann-La Roche AG, Roche Pharma Research and Early Development), Tony Kam-Thong (F. Hoffmann-La Roche AG, Roche Pharma Research and Early Development)

Published 2026-04-08

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

Imagine you are trying to find the perfect recipe for a cake. You have a massive pantry with thousands of ingredients (sugar, flour, eggs, spices, etc.). Your goal is to mix them together to create a cake that tastes the absolute best.

The problem? There are trillions of possible combinations. If you tried to bake every single one to see which is best, you'd be in the kitchen for a million years. This is what computer scientists call a "combinatorial optimization problem."

Most people try to solve this by baking randomly, or by following a strict checklist. But this paper proposes a clever new way: Stop looking at the ingredients as a list, and start looking at them as a mathematical system.

Here is the paper's idea, broken down into simple concepts and analogies.

1. The "Super Mario" Connection

The authors noticed that finding the best group of patients (or molecules) is surprisingly similar to playing Super Mario.

In Super Mario: You have basic moves: Jump, Run Left, Run Right, Go Down. You combine them to reach a goal. Interestingly, "Jump then Run" might get you to the same spot as "Run then Jump" in some contexts, or they might be functionally the same even if the button sequence looks different.
In Medicine/Science: You have basic rules like "Age > 65," "Blood Pressure < 120," or "Has Diabetes." You combine them with "AND" logic to find a specific group of people.

The paper argues that just like Mario's moves follow hidden mathematical laws, these medical rules follow hidden algebraic laws.

2. The "Duplicate Folder" Problem

Imagine you are organizing a massive library of books.

Rule A: "Books about Cats."
Rule B: "Books about Cats AND Books about Cats."

Mathematically, Rule A and Rule B are identical. They pick out the exact same books. But a computer looking at them as text strings sees them as different. It wastes time checking both.

The authors discovered that many different-looking rules actually produce the exact same result.

Rule 1: "Age > 50 AND Smoker = Yes"
Rule 2: "Smoker = Yes AND Age > 50"

These are the same. The paper calls these Equivalence Classes. Think of them as "Duplicate Folders."

3. The Magic Trick: The Quotient Space

This is the core of the paper. Instead of searching through the entire messy library (the huge search space), the authors suggest building a Quotient Space.

The Analogy:
Imagine you have 1,000 folders, but 900 of them are just duplicates of 100 unique folders.

The Old Way: You open every single one of the 1,000 folders to find the best document.
The New Way (Quotient Space): You realize the duplicates don't matter. You create a "Smart Filing System" that automatically groups all duplicates together and gives you one representative from each group. Now, instead of opening 1,000 folders, you only open 100.

This "Smart Filing System" is the Quotient Space. It collapses all the redundant, duplicate rules into single, unique representatives.

4. How They Did It (The "Binary Switch" Trick)

To make this work on a computer, they translated the rules into a language computers love: Binary Code (0s and 1s).

Imagine a row of light switches.
Switch 1 = "Age > 50" (On = 1, Off = 0)
Switch 2 = "Smoker" (On = 1, Off = 0)
Switch 3 = "Height > 6ft" (On = 1, Off = 0)

When you combine rules with "AND," the computer treats it like flipping switches. The authors proved that this system behaves exactly like a Boolean Hypercube (a multi-dimensional cube of switches).

Because they know the math behind these switches, they can use a special type of algorithm (a "Structure-Aware Genetic Algorithm") that knows: "Hey, I don't need to check every single combination. I just need to check one from each group of duplicates."

5. The Results: Why It Matters

The team tested this on real medical data (finding specific groups of patients with dry eye disease) and synthetic data.

The Standard Approach: The computer tried random combinations. It found the "perfect" solution only 35% to 37% of the time.
The New Approach (Quotient Space): The computer used the "Smart Filing System." It found the "perfect" solution 48% to 77% of the time.

The Takeaway:
By realizing that many rules are just "duplicates" in disguise, and by organizing the search space to ignore those duplicates, they made the computer smarter and faster.

Summary in One Sentence

This paper teaches us that many complex problems (like finding the best patient group or the best drug) have hidden mathematical patterns; if we organize our search to ignore "duplicate" solutions and focus only on unique ones, we can find the best answer much faster and more reliably.

It's like realizing you don't need to taste every single spoonful of soup to find the saltiest one; you just need to taste one spoonful from every distinct bowl, because some bowls are just copies of others.

1. Problem Statement

Combinatorial optimization problems (e.g., patient subgroup discovery, molecular screening, logistics) often suffer from exponential search spaces and poor convergence to global optima when treated as unstructured searches. Standard approaches fail to exploit underlying mathematical regularities, leading to:

Redundancy: Many distinct rule combinations yield functionally identical outcomes (e.g., different sequences of filters producing the same patient subset).
Inefficiency: Algorithms waste computational resources exploring equivalent solutions rather than diverse regions of the search space.
Lack of Structure: Researchers often overlook that these problems possess hidden algebraic structures (groups, monoids) that, if exposed, could drastically reduce the search space.

The core challenge is to systematically identify these algebraic structures, formalize them, and leverage them to create "quotient spaces" that collapse redundant representations while preserving optimization objectives.

2. Methodology: The General Framework

The authors propose a four-step framework to transform combinatorial problems into structured optimization tasks:

Step 1: Structural Analysis

Identify the components and operations of the real-world problem. The paper focuses on problems where solutions are formed by combining discrete atomic elements (e.g., clinical criteria or molecular filters) via logical conjunction.

Step 2: Algebraic Formalisation

Map the problem to abstract algebra concepts:

Monoid Structure: The set of composite rules $S$ formed by conjunctions of atomic rules forms a monoid $(S, \wedge, \epsilon)$ , where $\wedge$ is logical AND and $\epsilon$ is the identity (empty rule).
Isomorphism to Boolean Hypercube: The authors prove an isomorphism between the monoid of rules $(S, \wedge)$ $(S, \land)$ and the Boolean hypercube $(V, \vee)$ $(V, \lor)$ with bitwise OR.
- Key Insight: Logical AND in rules corresponds to Bitwise OR in the binary vector encoding.
- Representation: A rule is encoded as a binary vector where $1$ indicates the inclusion of an atomic rule. Combining rules becomes a simple bitwise OR operation.

Step 3: Quotient Space Construction

Define an equivalence relation to group functionally identical rules:

Equivalence Relation: Two rules $r_1$ and $r_2$ are equivalent ( $r_1 \sim r_2$ ) if they yield the same objective function value (e.g., the same biomarker fold change).
Quotient Space ( $S/\sim$ ): The search space is reduced from the set of all rules to the set of equivalence classes. Instead of searching every rule, the algorithm searches for the best representative from each class.
Approximate Equivalence: In practice, an $\epsilon$ -tolerance is used to cluster rules with similar (not necessarily identical) performance, acknowledging clinical noise.

Step 4: Structure-Aware Optimisation

Design algorithms that operate on the quotient space:

Quotient-Space-Aware Genetic Algorithm (GA):
- Encoding: Chromosomes map to binary atomic rule vectors.
- Niche Preservation: Periodically (every $k$ generations), the population is clustered based on objective function similarity (using DBSCAN). The best individual from each cluster (equivalence class) is preserved as a "niche elite."
- Diversity: This prevents premature convergence by ensuring the population explores distinct functional regions of the search space rather than multiple copies of the same solution.

3. Key Contributions

General Framework: A systematic methodology for discovering algebraic structures in combinatorial problems and converting them into quotient space optimization problems.
Theoretical Proof: Formal proof that conjunctive rule problems exhibit monoid structure and are isomorphic to the Boolean hypercube with bitwise OR, enabling efficient bitwise operations.
Algorithmic Innovation: Development of a Quotient-Space-Aware GA that explicitly detects equivalence classes and preserves niche elites to maintain diversity.
Empirical Validation: Extensive testing on real clinical data (patient subgroup discovery) and synthetic benchmarks.
Cross-Domain Applicability: Demonstration that the framework applies to diverse fields, including Rule-Based Molecular Screening (drug discovery) and Feature Selection.

4. Results

The framework was evaluated against standard Genetic Algorithms (GA), Bayesian Optimization (BO), and Greedy Search across real and synthetic datasets.

Global Optimum Achievement:
- Quotient-Aware GA: Achieved the global optimum in 48% to 77% of runs (depending on the dataset).
- Standard GA: Achieved the global optimum in only 35% to 37% of runs.
- BO and Greedy: Performed significantly worse (< 3% for BO, ~2.8% for Greedy) in discrete combinatorial spaces, highlighting the difficulty of these methods without structural exploitation.
Performance Metrics:
- Quotient-aware GAs consistently outperformed standard GAs in mean fitness scores (e.g., 82.63 vs. 79.52 on real data without numeric features).
- The method maintained high stability and solution diversity, avoiding the "stuck in local optima" issue common in standard approaches.
Robustness: The framework remained effective whether the data contained only categorical features or a mix of categorical and numeric features.

5. Significance and Implications

Bridging Theory and Practice: The paper successfully bridges the gap between abstract algebra (often viewed as purely theoretical) and practical data science, showing that algebraic structures are inherent in real-world optimization problems.
Efficiency: By collapsing redundant representations, the method transforms intractable search spaces into manageable, mathematically principled challenges.
Interpretability: In drug discovery and clinical research, the resulting rules are interpretable (e.g., "Patient Age > 65 AND Protein X > Y"), providing actionable insights for scientists.
Scalability: The approach offers a path to solving large-scale industrial optimization problems (e.g., screening billions of molecules) by reducing the search space from factorial complexity ( $O(N!)$ ) to exponential complexity ( $O(2^N)$ ) via quotient spaces.
Future Directions: The authors suggest extending this to complex logical forms (AND/OR mixtures), automatic structure detection, and integration with Reinforcement Learning (drawing parallels to the "Super Mario" analogy used in the paper).

In conclusion, the paper demonstrates that exposing and exploiting algebraic structure is a simple yet powerful route to more efficient combinatorial optimization, offering a generalizable solution for problems where redundancy and equivalence classes naturally exist.