Embracing Discrete Search: A Reasonable Approach to Causal Structure Learning

Imagine you are a detective trying to solve a mystery: Who caused what?

You have a pile of clues (data) about a system—maybe it's how genes interact, how stock prices move, or how a patient's symptoms relate to a disease. Your goal is to draw a map (a graph) showing the direction of influence: Did A cause B, or did B cause A?

For a long time, computer scientists have struggled with this. Some tried to solve it by turning the map into a smooth, slippery slide (continuous optimization), hoping a ball would roll to the bottom and find the answer. Others tried to check every single possible map one by one, but there are so many maps that the universe would end before they finished (exponential time).

This paper introduces a new detective named FLOP (Fast Learning of Order and Parents). FLOP decides to stop sliding and start walking. It embraces "discrete search"—checking specific, distinct maps—but it does so with a set of superpowers that make it incredibly fast and accurate.

Here is how FLOP works, explained with everyday analogies:

1. The Problem: The Maze of Possibilities

Imagine you are in a giant maze with millions of paths. You want to find the exit (the true causal map).

Old methods were like trying to fly over the maze (continuous optimization). Sometimes they get stuck in a valley that looks like the bottom but isn't.
Other methods were like checking every single path one by one. It's thorough, but it takes forever.
The issue: In the real world, we don't have infinite time or infinite data. We need a method that is fast and smart enough to avoid getting stuck in "local traps" (dead ends that look good but aren't the best).

2. FLOP's Superpower #1: The "Warm Start" (Don't Start from Scratch)

When FLOP tries to move a piece of the puzzle (a variable) to a new spot, old methods would throw away all their previous work and start building the neighborhood from zero. It's like moving a house and having to rebuild the foundation, walls, and roof from scratch every time you move a single brick.

FLOP's trick: It remembers what it just did. If it knows the neighbors of a house, and you just move that house one street over, FLOP only checks the one new neighbor. It reuses the old work.

Analogy: Imagine you are rearranging furniture in a room. Instead of measuring the whole room every time you move a chair, you just measure the tiny gap the chair left and the new spot. This saves massive amounts of time.

3. FLOP's Superpower #2: The "Magic Calculator" (Cholesky Updates)

To decide if a map is good, FLOP has to do some heavy math (calculating probabilities). Usually, this is like doing a massive, complex multiplication problem every single time you make a tiny change.

FLOP's trick: It uses a mathematical shortcut called a "Cholesky update."

Analogy: Imagine you have a recipe for a cake. If you want to make a slightly bigger cake, a normal chef recalculates the whole recipe from scratch. FLOP is like a chef who knows: "Oh, I just added one egg. I only need to adjust the flour by a tiny bit." It updates the math instantly instead of re-doing the whole calculation. This makes it 100 times faster than previous top methods.

4. FLOP's Superpower #3: The "Smart Starter" (Principled Initialization)

Many algorithms start by guessing a random order of events. Imagine trying to solve a jigsaw puzzle by dumping the pieces on the floor and hoping the first piece you pick is the corner. If the puzzle is a long line (a "path"), random guessing often fails miserably.

FLOP's trick: It looks at the data first to build a "smart guess." It groups things that are strongly related together right from the start.

Analogy: Instead of dumping the puzzle pieces randomly, FLOP first sorts them by color and edge shape. It builds the frame first. This prevents it from getting stuck in a bad starting position.

5. FLOP's Superpower #4: The "Hiker with a Map" (Iterated Local Search)

Even with a smart start, you might get stuck in a small valley (a local optimum) thinking it's the highest peak.

Old methods would stop there and say, "This is the best I can do."
FLOP says, "Let's shake things up." It takes the best map it found, scrambles it slightly (like shaking a snow globe), and starts searching again. It repeats this over and over.
The Result: Because FLOP is so fast (thanks to Superpowers 1 & 2), it can afford to do this "shake and search" hundreds of times. It explores the whole mountain range to find the true highest peak, not just the first hill it sees.

The Big Takeaway

For years, the field of causal discovery was told: "Discrete search is too slow; you must use continuous methods." This paper says: "No, you just needed a faster engine."

FLOP proves that if you optimize your tools (using warm starts and math shortcuts), you can go back to the reliable, logical method of checking specific maps. It finds the true cause-and-effect relationships more accurately and faster than the "slippery slide" methods, even on complex problems with hundreds of variables.

In short: FLOP is the detective who stopped trying to fly over the maze and instead learned to run through it with a flashlight, a map, and a very efficient pair of shoes.

1. Problem Statement

The paper addresses the fundamental task of Causal Structure Learning: inferring the Directed Acyclic Graph (DAG) representing the data-generating process from observational data under the assumption of causal sufficiency.

Context: The authors focus on Linear Additive Noise Models (ANMs) with Gaussian noise.
Challenge: While theoretical guarantees exist for score-based methods (using the Bayesian Information Criterion, BIC) to find the optimal graph in the infinite-sample limit, finite-sample performance is often poor due to local optima.
Current Landscape:
- Exact algorithms (e.g., dynamic programming) are NP-hard and limited to $\approx 30$ variables.
- Continuous relaxation methods (e.g., NOTEARS, DAGMA) have become popular but face issues with convergence, optimization complexity, and the need for surrogate objectives.
- Discrete search (local search over DAGs) is often dismissed due to perceived combinatorial hardness, yet the paper argues this hardness relies on assumptions (unobserved variables) that do not apply to standard sparse DAG discovery settings.

2. Methodology: The FLOP Algorithm

The authors introduce FLOP (Fast Learning of Order and Parents), a score-based algorithm that fully embraces discrete search by combining order-based local search with aggressive computational optimizations.

Core Strategy

FLOP traverses the space of topological orders (permutations of nodes). For a given order, it constructs a DAG by selecting parent sets for each node using a Grow-Shrink procedure. It iteratively improves the order by reinserting nodes into different positions to maximize the BIC score.

Key Technical Innovations

The paper introduces four specific components to make discrete search feasible and highly accurate:

Warm-Start Grow-Shrink (Section 3.1):
- Problem: Traditional Grow-Shrink (used in BOSS) starts from an empty parent set for every node evaluation, which is computationally expensive.
- Solution: FLOP initializes the Grow-Shrink procedure with the parent set learned from the previous order. Since moving a node in the order only changes the candidate parent set by one node (insertion or deletion), the algorithm "warms up" from the previous result.
- Non-Greedy Approach: Unlike BOSS, which greedily picks the single best parent, FLOP accepts any parent that improves the score. This avoids the need for complex "grow-shrink trees" data structures.
- Theoretical Guarantee: The authors prove that this modified Grow-Shrink still finds the restricted Markov boundary, ensuring the resulting DAG is Markovian to the distribution.
Dynamic Cholesky Updates (Section 3.2):
- Problem: Computing the local BIC score requires estimating conditional variance, typically involving matrix inversion or decomposition ( $O(k^3)$ ).
- Solution: Since the parent set changes by only one node at a time, FLOP uses rank-one updates and downdates on the Cholesky factorization of the covariance matrix.
- Impact: This reduces the complexity of score updates from $O(k^3)$ to $O(k^2)$ , providing a massive speedup for dense graphs.
Principled Initialization (Section 4.1):
- Problem: Random initial orders fail on "path graphs" where distant ancestor-descendant pairs have weak marginal dependence, causing the Grow-Shrink to miss edges.
- Solution: FLOP constructs an initial order by iteratively selecting the variable with the smallest residual variance when regressed on the already-placed variables. This places strongly correlated nodes adjacent to each other, facilitating edge discovery.
Iterated Local Search (ILS) (Section 4.2):
- Strategy: To escape local optima, FLOP employs ILS. After finding a local optimum, it perturbs the best-found order (via random swaps) and restarts the local search.
- Flexibility: The number of restarts ( $k$ ) acts as a hyperparameter, allowing users to trade computation time for accuracy.

3. Key Contributions

Algorithmic Efficiency: FLOP achieves run-times >100x faster than the state-of-the-art order-based algorithm (BOSS) on graphs with 100+ nodes, scaling up to 500 nodes where BOSS times out.
Accuracy: By enabling extensive ILS, FLOP finds graphs with BIC scores at or near the global optimum, often recovering the true CPDAG (Completed Partially Directed Acyclic Graph) in standard settings.
Re-evaluation of Discrete Search: The paper challenges the consensus that discrete search is impractical for large graphs, arguing that with proper optimization, it outperforms continuous relaxations.
Open Source: A Rust implementation (flopsearch) is provided, accessible via Python.

4. Experimental Results

The authors evaluated FLOP against PC, GES, DAGMA, BOSS, and LiNGAM across various benchmarks:

Erdős-Rényi (ER) Graphs:
- On 50-node graphs (avg degree 8), FLOP variants (with ILS) achieved near-perfect recovery (60% exact CPDAG recovery vs. 0% for others like PC/GES/DAGMA).
- FLOP found graphs with better BIC scores than the ground truth in many cases, indicating that the global BIC optimum does not always equal the true graph in finite samples (a known phenomenon).
Path Graphs:
- Standard methods (BOSS, random-order FLOP) failed significantly. FLOP with principled initialization matched the performance of PC and GES, recovering the structure in 72% of cases.
Scale-Free & Real-World Networks:
- On the Alarm network (37 nodes), FLOP100 achieved 82% exact recovery, significantly outperforming BOSS (6%) and GES (16%).
- On the Pathfinder network (109 nodes), FLOP was the only method to terminate within the time limit, though recovery was limited by score misspecification (BIC favored structures different from the truth).
Robustness:
- FLOP remained robust under uniform noise and unstandardized data.
- In non-linear settings (MLP, Gaussian Processes), FLOP still optimized the Gaussian BIC effectively, though graph recovery suffered due to model misspecification (as expected).

5. Significance and Discussion

Shift in Paradigm: The paper argues that the "hardness" of causal discovery is often a result of inefficient search strategies rather than inherent combinatorial complexity. By optimizing the search process (warm starts, Cholesky updates), discrete search becomes the state-of-the-art approach for linear ANMs.
Compute-Accuracy Trade-off: FLOP explicitly treats computation time as a hyperparameter. It demonstrates that "more compute" (via ILS) directly translates to higher accuracy, a relationship often ignored by one-shot heuristic algorithms.
Limitations of Continuous Methods: The results cast doubt on the superiority of continuous relaxations (like DAGMA), showing that discrete search can find better-scoring graphs faster. The paper suggests that future research should focus less on optimization algorithms and more on designing scoring criteria that are robust to finite-sample deviations and model misspecification.
Practical Implication: For practitioners working with linear models, FLOP offers a reliable, fast, and highly accurate tool that scales to hundreds of variables, rendering exact exponential-time search unnecessary for many practical problems.

In conclusion, FLOP revitalizes discrete search for causal discovery, proving that with the right algorithmic engineering, it is a "reasonable" and superior approach compared to continuous optimization methods for linear additive noise models.