Improving Diffusion Planners by Self-Supervised Action Gating with Energies

Imagine you are teaching a robot to walk through a complex maze or pick up a cup of coffee. You don't want the robot to learn by bumping into walls and breaking things (that's dangerous and expensive). Instead, you give it a video library of other robots doing the job perfectly. This is called Offline Reinforcement Learning.

Recently, a new type of "brain" for robots called a Diffusion Planner became very popular. Think of a Diffusion Planner like a creative dreamer. When the robot needs to move, the planner doesn't just pick one action; it dreams up hundreds of possible future paths (like "walk left," "jump over," "slide right") and then tries to pick the best one.

The Problem: The "Dreamer" vs. The "Reality Check"

Here is the catch: The Diffusion Planner is so creative that it sometimes dreams up paths that look amazing on paper but are physically impossible.

Imagine the planner dreams up a path where the robot instantly teleports to the finish line. A "scorekeeper" (a value function) looks at this path and says, "Wow, that gets to the goal fast! That's a 10/10!" So, the robot tries to execute it. But because the robot can't actually teleport, it crashes into a wall.

In technical terms, the planner is optimizing for a high score but ignoring local feasibility (whether the very next step is actually possible given the robot's current position and physics).

The Solution: SAGE (The "Reality Check" Gatekeeper)

The authors of this paper propose a new method called SAGE (Self-supervised Action Gating with Energies).

Think of SAGE as a strict bouncer or a reality-checking editor who stands between the "Dreamer" (the planner) and the "Doer" (the robot).

Here is how SAGE works, using a simple analogy:

1. The Training Phase: Learning the "Rules of Physics"

Before the robot ever moves, SAGE studies the video library of past successes. It doesn't just memorize the moves; it learns the rhythm and flow of how things actually happen.

The Analogy: Imagine a dance instructor watching thousands of hours of dance videos. They don't just memorize the steps; they learn that if you lift your left leg, your right arm must swing in a specific way to keep balance. If a dancer tries to lift their left leg and freeze their right arm, the instructor knows immediately: "That's not a real dance move; that's a glitch."
The Tech: SAGE uses a special AI architecture (JEPA) to learn these "rhythms" in a hidden language (latent space). It learns to predict: "If the robot is in state A and does action B, the next state should look like C."

2. The Inference Phase: The "Energy" Test

Now, the robot is in the real world. The Diffusion Planner (the Dreamer) generates 50 possible future paths.

The Old Way: The robot just picks the path with the highest score.
The SAGE Way: Before the robot picks a winner, SAGE runs a quick test on the first few steps of every path.
- It asks: "If the robot tries to do the first step of this path, does it match the rhythm we learned from the videos?"
- If the path is weird (e.g., teleporting, sliding through walls), SAGE assigns it a high "Energy" score. In physics, high energy usually means something is unstable or unlikely.
- If the path looks natural and smooth, it gets a low "Energy" score.

3. The Final Decision

SAGE tells the robot: "Hey, discard the top 20% of these paths because they have high Energy (they are physically impossible). From the remaining 80% that look realistic, pick the one with the best score."

Why is this a big deal?

It's a Plug-and-Play Upgrade: You don't need to retrain the main robot brain (the Diffusion Planner). You just add this "bouncer" (SAGE) to the entrance. It works with existing systems.
No New Training Data: SAGE learns entirely from the existing video library. It doesn't need the robot to go out and crash into things to learn what not to do.
It Prevents "Brittle" Behavior: Without SAGE, a robot might commit to a crazy plan, fail immediately, and then have to panic and re-plan. SAGE stops the robot from even trying the crazy plans in the first place.

Summary Analogy

The Diffusion Planner is like a screenwriter who writes 100 exciting movie scripts. Some are great, but some have plot holes (e.g., the hero flies without wings).
The Value Function is the producer who picks the script with the most explosions and money.
SAGE is the physics consultant. Before the producer picks the script, the consultant says, "Script #42 has great explosions, but the hero flies. That's impossible. Let's cross it out."
The Result: The movie (the robot's action) is still exciting, but it's actually possible to film (execute).

By adding this "Reality Check," the robot becomes more reliable, less likely to crash, and better at completing long, complex tasks like walking across a room or assembling furniture.

1. Problem Statement

Diffusion planners have emerged as a powerful approach for offline reinforcement learning (RL), capable of modeling complex, multimodal distributions of state-action trajectories. However, they suffer from a critical failure mode known as dynamic inconsistency:

The Issue: Diffusion planners typically generate many candidate trajectories and select the "best" one based on a learned value function (critic). This value function often rewards trajectories that look promising in the long term but contain locally inconsistent transitions (e.g., a robot arm moving through a wall or a leg moving in a physically impossible way) that cannot be executed from the current state.
The Consequence: When the agent attempts to execute the first step of such a trajectory, it fails because no action can produce the predicted next state. This leads to brittle execution, compounding errors, and catastrophic failure during replanning.
Limitations of Existing Solutions: Previous methods attempt to fix this by adding constraints during the generative process (guided sampling) or using hard verifiers. These approaches often require retraining, complex auxiliary models, or negative sampling strategies, limiting scalability and modularity.

2. Methodology: SAGE

The authors propose Self-supervised Action Gating with Energies (SAGE), an inference-time module that decouples feasibility from value. SAGE acts as a re-ranking mechanism that penalizes dynamically inconsistent plans without modifying the underlying diffusion generator or the value critic.

Core Architecture

SAGE consists of two components trained purely offline:

JEPA Encoder (Stage I):
- Uses a Joint-Embedding Predictive Architecture (JEPA) trained on offline state sequences.
- It learns a latent representation $z$ where future states are predictable from a context window.
- It employs an Exponential Moving Average (EMA) teacher and self-supervised masking (similar to MAE) to learn robust, dataset-consistent dynamics without reward signals.
Action-Conditioned Latent Predictor (Stage II):
- A predictor $f_\eta$ trained in the frozen JEPA latent space.
- It takes a latent state $z_t$ and an action $a_t$ to predict the next latent state $\hat{z}_{t+1}$ .
- Training Objectives:
  - Teacher-forced loss: Predicts the next latent given ground-truth prefixes.
  - Rollout loss: Ensures consistency over short horizons via autoregressive prediction.
  - Action-usage hinge: A novel regularization term that permutes actions within a batch. If the predictor ignores the action (i.e., predicts the same state regardless of action), the error remains low; the hinge loss penalizes this, forcing the predictor to be sensitive to the specific action taken.

Inference-Time Gating Mechanism

At test time, SAGE operates as follows:

Generation: The base diffusion planner samples $C$ candidate trajectories $\hat{\tau}^{(i)}$ .
Energy Evaluation: For each candidate, SAGE computes a Latent Consistency Energy over a short prefix (length $K$ $K$ ):
$E(\hat{\tau}^{(i)}) = \frac{1}{K} \sum_{k=0}^{K-1} \| f_\eta(z_{t+k}, a_{t+k}) - z_{t+k+1} \|_1$
- Low Energy: The transition is consistent with dataset dynamics (feasible).
- High Energy: The transition is dynamically inconsistent (infeasible).
Re-ranking: The candidates are filtered and re-ranked by combining the original value score $J$ $J$ with the energy penalty:
$i^* \in \arg \max_{i \in I_{retained}} \left( J(\hat{\tau}^{(i)}) - \lambda E(\hat{\tau}^{(i)}) \right)$
- First, the top $P$ fraction of candidates with the lowest energy are retained.
- Then, the best candidate is selected from this subset using the penalized score.

3. Key Contributions

Decoupling Feasibility and Value: SAGE explicitly separates the assessment of "is this plan executable?" (feasibility) from "is this plan good?" (value), addressing the tension where value optimizers often overlook local physical constraints.
Self-Supervised Learning without Rollouts: Unlike prior methods requiring environment interaction or negative sampling, SAGE learns feasibility purely from offline data using predictive self-supervision (JEPA).
Modular and Plug-and-Play: SAGE does not require retraining the diffusion generator or the value critic. It can be integrated into existing diffusion planning pipelines (like Diffuser or DV) with minimal overhead.
Action-Conditioned Predictor: The introduction of the action-usage hinge loss ensures the feasibility signal is truly sensitive to the specific actions taken, preventing the model from learning a trivial state-only dynamics model.

4. Experimental Results

The authors evaluated SAGE on the D4RL benchmark across three domains: Locomotion (MuJoCo), Manipulation (Kitchen), and Navigation (AntMaze, Maze2D).

Performance Gains:
- Locomotion: SAGE improved the average score of the state-of-the-art Diffusion Planner (DV) from 82.9 to 84.4.
- Manipulation: Significant improvements in the Kitchen environment, raising scores from 73.6 to 74.5 (Mixed) and 90.0 to 96.6 (Partial).
- Navigation: Consistent improvements in AntMaze (81.6 $\to$ 84.5) and Maze2D (161.6 $\to$ 163.1).
Feasibility Diagnostics:
- Experiments where action sequences were artificially corrupted showed that SAGE's energy metric produced sharp, localized spikes exactly at the point of corruption, achieving high AUROC scores (0.94–0.99) in detecting infeasible transitions.
Efficiency:
- SAGE adds a negligible computational overhead (approx. 6.8% increase in wall-clock inference time) because it only evaluates a short prefix of the trajectory.
Ablation Studies:
- The method is robust to hyperparameters ( $K$ , $P$ , $\lambda$ ).
- Comparisons showed SAGE significantly outperforms simple state-space forward models (Ridge/MLP) and random latent baselines, proving the importance of the JEPA representation and action-conditioning.

5. Significance

This work addresses a fundamental bottleneck in offline RL: the gap between planning (generating high-value futures) and execution (ensuring those futures are physically realizable).

Reliability: By filtering out dynamically inconsistent prefixes, SAGE makes diffusion planners significantly more robust, reducing the "brittle execution" problem that plagues long-horizon tasks.
Scalability: The self-supervised, offline-only training paradigm makes SAGE highly scalable to large, diverse datasets without the need for expensive online exploration or complex constraint engineering.
Generalizability: As a model-agnostic selector, SAGE can be applied to any diffusion-based planner, offering a practical path toward more reliable autonomous decision-making in robotics and AI.