Diffusion-MPC in Discrete Domains: Feasibility Constraints, Horizon Effects, and Critic Alignment: Case study with Tetris

Imagine you are playing a high-stakes game of Tetris, but instead of controlling the blocks yourself, you have a super-smart robot assistant. This assistant doesn't just pick one move; it tries to "dream" up a whole sequence of future moves, picks the best one, and then you execute the very first step of that dream.

This paper introduces DIFFTETRIS, a new way to build that robot assistant using a type of AI called a Diffusion Model. Think of a diffusion model like a sculptor who starts with a block of noisy, random clay and slowly chips away the noise to reveal a perfect statue. In this case, the "statue" is a perfect sequence of Tetris moves.

Here is the story of what the researchers found, broken down into simple analogies:

1. The "Impossible Move" Problem (Feasibility Constraints)

The Analogy: Imagine the robot is trying to plan a road trip. In the "unconstrained" version, the robot might suggest driving straight through a mountain or flying over a canyon because it's just guessing randomly. In Tetris, this is like the robot suggesting to drop a block into a spot where it physically doesn't fit. If you try to make that move, the game crashes immediately.

The Fix: The researchers added a "Feasibility Mask."
Think of this as a traffic cop standing next to the robot. Before the robot can even suggest a move, the traffic cop checks the board. If a move is illegal (like trying to fit a square peg in a round hole), the cop slaps a "STOP" sign on it.

The Result: Without the traffic cop, the robot wasted 46% of its time suggesting impossible moves. With the cop, the robot only suggests legal moves. This single change made the robot 6.8 times better at surviving and scoring. It turned a chaotic mess into a focused search for the right moves.

2. The "Bad Coach" Problem (Critic Alignment)

The Analogy: After the robot dreams up 64 different road trips, it needs a coach to pick the best one.

The Heuristic Coach: This is an old-school expert who knows the rules of Tetris perfectly. They look at the board and say, "Don't leave holes! Keep it flat!"
The DQN Coach: This is a "learned" coach (an AI trained by playing the game itself). You'd think a trained AI would be better, right?

The Surprise: The researchers found that the DQN Coach was actually terrible.
Even though the DQN coach had played the game before, it was systematically picking the worst road trips. It was like a coach who loves the color blue and keeps picking blue cars, even if the blue cars have no engines.

The Metric: They measured "Regret." This is the difference between the score you could have gotten with the best move and the score you actually got with the coach's choice. The DQN coach had huge regret—it was actively hurting the player.
The Lesson: Just because an AI is "trained" doesn't mean it understands the specific task of planning ahead. It was good at reacting to the current moment but bad at judging a whole sequence of future moves.

3. The "Crystal Ball" Problem (Horizon Effects)

The Analogy: Imagine you are planning your day.

Short Horizon (H=4): You plan the next 4 hours. You know exactly what's happening.
Long Horizon (H=8): You try to plan the next 8 hours. But the further out you look, the more foggy your crystal ball becomes. You start guessing about things you don't know yet (like what the next Tetris piece will be).

The Finding: The robot performed better when it looked further into the future (H=8) than when it looked only a little bit (H=4)?
Actually, no! The robot did worse with the longer plan.
Because the robot's "dreaming" process gets fuzzier the further out it goes, trying to plan 8 steps ahead introduced too much confusion and error. It was like trying to solve a math problem by guessing the answer to the last step first; the errors piled up.

The Winner: The robot was fastest and most accurate when it only planned 4 steps ahead. It was "less is more."

4. The "More Eyes" Problem (Compute Scaling)

The Analogy: Imagine you are trying to find a needle in a haystack.

Option A: You have 16 friends looking for the needle.
Option B: You have 64 friends looking for the needle.

The Finding: The more friends (candidates) you have, the better the result.
If you give the robot more time to generate more "dreams" (candidates) to choose from, it finds better moves. However, this takes more computer power and time.

The Trade-off: If you want the absolute best score, you need 64 candidates. If you want a fast game, 16 candidates are "good enough" and much quicker.

The Grand Conclusion

The paper teaches us three big lessons for building AI that plays games or makes decisions:

Don't let the AI break the rules: You must force the AI to only consider legal moves (Feasibility Masking). Without this, the AI is just guessing in the dark.
Be careful with "Smart" Coaches: An AI trained to play the game isn't necessarily good at planning the game. Sometimes, simple, human-made rules (heuristics) are better at judging a sequence of moves than a complex neural network.
Short-term planning is often better: In complex games with random elements, trying to predict too far into the future creates more confusion than clarity. Sometimes, it's better to plan just a few steps ahead and do it very well.

In short, DIFFTETRIS works best when it is forced to play by the rules, guided by simple wisdom rather than a confused "smart" coach, and when it focuses on the immediate future rather than a foggy, distant one.

1. Problem Statement

The paper addresses the challenge of applying Diffusion Model Predictive Control (Diffusion-MPC) to discrete, combinatorial domains with hard feasibility constraints.

Context: While diffusion models have succeeded in continuous control (e.g., robotics, video generation), applying them to discrete action spaces (like Tetris) is difficult because a single invalid action renders an entire trajectory unusable.
Domain: The study uses Tetris, an NP-hard combinatorial puzzle. The action space is discrete (rotation $\times$ x-position), and validity depends dynamically on the board state and piece geometry.
Core Challenge: How to sample valid action trajectories from a generative model, select the best one among candidates, and manage the trade-offs between planning horizon, computational cost, and the alignment of learned critics with the actual rollout objective.

2. Methodology: DIFFTETRIS

The authors propose DIFFTETRIS, a diffusion-style MPC planner for Tetris. The system consists of three main components:

A. PlanDenoiser Architecture

Model: A conditional Transformer (4 layers, 4 heads, $d=128$ ) based on the MaskGIT architecture.
Input: Encoded board state (via CNN), current piece, next piece, and a sequence of partially masked $(rotation, x)$ tokens.
Training: Trained via behavior cloning on expert trajectories generated by a heuristic agent using a masked prediction objective (predicting masked tokens via cross-entropy).

B. Feasibility-Constrained Sampling

Unlike continuous domains where small deviations are tolerable, discrete domains require strict validity.

Unconstrained Sampling: Standard parallel MaskGIT sampling often produces invalid moves.
Masked Sampling (Proposed): The authors modify the sampling process to be autoregressive over the horizon. At each step $h$ , they compute a valid placement mask $m$ based on the simulated board state. Logits for invalid $(r, x)$ pairs are set to $-\infty$ before softmax sampling.
Cost: This forces sequential sampling (no parallelism) but guarantees every sampled action is executable.

C. Reranking Strategies

After sampling $K$ candidate trajectories of length $H$ , the system must select one to execute. Three strategies are compared:

Heuristic Reranking: Scores candidates by simulating the trajectory and applying a hand-crafted function (lines, holes, height, bumpiness).
DQN Reranking: Uses a pre-trained Deep Q-Network (DQN) critic to evaluate the final state of the trajectory.
Hybrid Reranking: Combines the heuristic rollout score with a normalized DQN score ( $v_{hybrid} = v_{rollout} + \alpha \cdot z(v_{dqn})$ ).

D. Diagnostic Metric: Decision-Level Regret

The authors introduce Decision-Level Regret to measure critic alignment:
$\text{regret}_t = \max_k (v^{\text{rollout}}_k) - v^{\text{rollout}}_{k^*}$
Where $k^*$ is the candidate selected by the reranker. If the reranker chooses the candidate with the highest rollout score, regret is 0. Positive regret indicates the critic selected a suboptimal candidate relative to the true rollout objective.

3. Key Contributions

Discrete Diffusion-MPC Implementation: Successfully adapted MaskGIT-style diffusion for Tetris, demonstrating that feasibility-constrained sampling is non-negotiable for performance.
Feasibility Filtering Necessity: Proved that masking invalid actions is critical, transforming the search space from "mostly invalid" to "executable."
Critic Misalignment Diagnosis: Revealed that a standard DQN critic, even when trained on the same domain, is systematically misaligned with the rollout objective when used for reranking diffusion samples, leading to high regret.
Horizon and Compute Analysis: Characterized how planning horizon ( $H$ ) and candidate count ( $K$ ) interact, showing that shorter horizons can outperform longer ones in sparse-reward environments due to uncertainty compounding.

4. Key Results

A. Impact of Feasibility Masking

Performance Gain: Masking improved the mean score from 0.13 to 0.89 (6.8 $\times$ ) and survival rate from 5% to 28% (5.6 $\times$ ).
Action Space: Approximately 46% of the action space is invalid at any given step. Without masking, the planner wastes compute on non-executable trajectories.
Conclusion: Masking restores the effective search space, making the planner viable.

B. Critic Alignment (Heuristic vs. DQN)

DQN Failure: Replacing the heuristic with a DQN critic caused a catastrophic drop in performance (Mean Score: 0.89 $\to$ 0.14).
Regret Analysis: DQN reranking exhibited high decision regret (Mean 17.6 at $H=8$ ), meaning it frequently selected candidates that were significantly worse than the best available option according to the rollout simulator.
Cause: The DQN was trained on its own policy's distribution, leading to Out-of-Distribution (OOD) errors when evaluating trajectories generated by the diffusion model. It also failed to capture the specific multi-step penalties (holes/bumpiness) that the heuristic explicitly enforces.
Hybrid Success: A hybrid approach with a small mixing weight ( $\alpha=0.05$ ) recovered heuristic-level performance while keeping regret near zero, suggesting learned critics can be used safely if their influence is bounded.

C. Horizon Effects ( $H=4$ vs. $H=8$ )

Counter-Intuitive Finding: A shorter horizon ( $H=4$ ) outperformed a longer one ( $H=8$ ) in both score (1.48 vs. 0.89) and latency (1663ms vs. 2761ms).
Reasoning: In Tetris, rewards are sparse and delayed. Longer horizons rely on "imagined futures" (unknown piece sequences), leading to compounding simulation errors and distribution shift. The model, trained on short-horizon behavior, struggles to plan reliably over 8 steps.
Critic Impact: Shortening the horizon also reduced DQN regret, further supporting the idea that longer rollouts amplify the mismatch between the critic and the planner.

D. Compute Scaling ( $K$ )

Monotonic Improvement: Increasing the number of candidates ( $K$ ) from 16 to 64 significantly improved scores (0.31 $\to$ 0.89).
Trade-off: Performance scales with compute, but latency increases linearly. The "latency-normalized score" peaked at $K=16$ , suggesting an optimal operating point depends on deployment constraints (throughput vs. absolute quality).

5. Significance and Conclusion

The paper provides critical insights for applying generative models to discrete control:

Feasibility is Paramount: In discrete domains, generative models must be constrained to valid actions during sampling; post-hoc filtering is insufficient.
Critic Alignment is Fragile: Learned value functions (DQNs) are not drop-in replacements for heuristics in MPC. They can be "anti-helpful" if not perfectly aligned with the proposal distribution and the specific rollout objective.
Shorter Horizons Can Be Better: In environments with sparse rewards and high uncertainty, shorter planning horizons often yield better results than longer ones, avoiding the compounding of simulation errors.
Regret as a Diagnostic: The proposed decision-level regret metric is a powerful, model-free tool for diagnosing why a reranking strategy fails.

Conclusion: Diffusion-MPC in discrete domains requires a holistic approach balancing feasibility-aware sampling, careful critic alignment (or bounded hybrid approaches), and compute-aware horizon selection. Simply scaling up generative models without addressing these constraints leads to suboptimal or failing systems.