Imagine you are teaching a robot to perform a delicate task, like stacking cups or sliding a mouse across a table. You do this by showing it videos of a human doing the job perfectly. This is called "behavior cloning."

However, there's a catch: humans aren't perfect. Even when we try to move smoothly, our hands have tiny, involuntary jerks, pauses, and shakes. These are like "high-frequency noise" in a signal.

When a robot tries to learn from these videos, it often copies the bad habits along with the good ones. It learns to shake and jerk just like the human did. This is especially bad for a type of AI called a Diffusion Policy. Think of a diffusion policy like a sculptor who starts with a block of noisy, static-filled clay and slowly chips away the noise to reveal the statue. The problem is, if the original clay (the human data) has weird, jagged cracks in it, the sculptor might accidentally make those cracks bigger while trying to smooth things out, resulting in a jerky, unstable robot arm.

The Solution: Frequency Guidance Operator (FGO)

The authors of this paper, led by Junlin Wang, propose a new method called Frequency Guidance Operator (FGO) to fix this. Here is how it works, using some simple analogies:

1. The "Blur and Sharpen" Analogy

Imagine you have a photo of a human moving their hand.

The Problem: The photo is blurry (low frequency) but also has static and grain (high-frequency noise). If you try to sharpen the whole photo at once, the grain gets amplified, making the image look worse.
The Old Way: Standard AI tries to learn the whole picture (smooth motion + jerky noise) all at once.
The FGO Way: This new method teaches the AI to look at the photo in layers. First, it looks at the big, blurry shapes (the general path of the hand). Once that path is clear, it slowly adds in the fine details. Crucially, it learns to ignore the "grain" (the noise) while adding the details.

2. The "Sub-Frequency Manifold" (The Smooth Path)

The paper talks about "sub-frequency manifolds." Imagine a mountain trail.

The Full Path: The trail has the main road, but also lots of loose rocks, potholes, and jagged edges (the noise).
The FGO Path: The AI is trained to walk on a series of smooth, paved paths that run parallel to the main trail.
- First, it walks on a very wide, smooth path that only shows the general direction (low frequency).
- Then, it moves to a slightly more detailed path.
- Finally, it moves to the full, detailed path.
- By stepping through these "smooth paths" one by one, the AI learns to reach the destination without ever stepping on the jagged rocks. It effectively "filters out" the human's jerky movements before they become part of the robot's muscle memory.

3. The "Guided Sculptor"

During the robot's thinking process (called "reverse denoising"), the AI usually tries to guess the next move based on pure noise.

FGO acts like a guide: It whispers to the AI, "Hey, don't worry about the tiny, fast shakes right now. Focus on the big, slow movement first."
As the AI gets closer to making a decision, the guide slowly says, "Okay, now you can add a little bit of detail, but keep it smooth."
This ensures the robot's final movement is fluid and consistent, rather than a jittery copy of a human's nervous twitch.

What Did They Find?

The researchers tested this on 15 different robot tasks, ranging from simple tasks like lifting a block to complex ones like using a dexterous hand to turn a doorknob or hammer a nail. They tested these in computer simulations and on a real robot arm in a lab.

Smoother Movements: Robots using FGO moved much more smoothly. They had fewer jerks and pauses.
Better Success Rates: Because the movements were smoother and more predictable, the robots actually finished the tasks more often than robots using the old methods.
Real-World Proof: They even tested it on a real robot arm picking up cups and sliding a mouse, and it worked better than the standard methods.

The Trade-off

The paper admits one small downside: because the AI has to take these extra "smooth steps" to figure out the movement, it takes a tiny bit longer to think (a few milliseconds more) than the standard method. However, the authors argue that the gain in smoothness and success rate is worth this tiny delay.

In short: FGO teaches robots to learn from humans by focusing on the "big picture" first and filtering out the "nervous jitters," resulting in robots that move like graceful dancers rather than shaky copycats.

Technical Summary: Frequency-Guided Action Diffusion via Sub-Frequency Manifold Traversal

Problem Statement

Learning visuomotor policies via behavior cloning often suffers from the "pathology" of inheriting high-frequency noise present in human expert demonstrations. Natural human data inevitably contains intermittent jerks, pauses, and action jitter. When diffusion-based policies are trained to directly imitate these raw, full-frequency trajectories, they tend to overfit to these spurious high-frequency variations. This results in erratic, jerky motor commands during deployment.

This issue is particularly acute in diffusion policies because the iterative denoising process, while conceptually following a coarse-to-fine paradigm, can inadvertently amplify high-frequency artifacts at the expense of meaningful fine-grained details. Standard diffusion models learn a direct mapping from noise to the full-frequency data manifold, a broadband objective that is exceptionally challenging for complex, nonlinear tasks where low-frequency intents and high-frequency details are temporally entangled.

Methodology: Frequency Guidance Operator (FGO)

To address these limitations, the authors propose the Frequency Guidance Operator (FGO), a novel diffusion guidance mechanism that implicitly enforces a spectral hierarchy during the generation process. The core idea is to steer the reverse denoising process through a hierarchy of intermediate sub-frequency manifolds with expanding spectral bands, rather than forcing noisy samples directly toward the full-frequency manifold.

1. Learning Multi-Band Mappings (Training Phase)

Instead of training a model to predict the full-frequency data manifold directly, FGO trains the noise predictor to learn mappings from noise to sub-frequency data manifolds.

Frequency Truncation: During training, clean action chunks $A^0_t$ are passed through a bank of discrete low-pass filters ( $L_f$ ) defined by a cut-off frequency $f$ . This produces frequency-truncated sequences $A^{0,f}_t$ .
Conditional Prediction: The noise predictor $\epsilon_\theta$ is augmented to explicitly condition on the cut-off frequency $f$ , taking the form $\epsilon_\theta(A^{k,f}_t, k, O_t, f)$ .
Sampling Strategy: To ensure stability, the cut-off frequency $f$ is sampled such that it equals a base frequency $f_{base}$ with probability $p_{base}$ , or is sampled uniformly from $[f_{base}, f_{max}]$ otherwise. This establishes a stable low-frequency baseline essential for the guided process.
k-f Coupled (KFC) Sampling: To prevent the model from wasting capacity on high-frequency predictions at high noise levels (where high-frequency signals are dominated by noise), the upper bound of the cut-off frequency $f_{max}$ is dynamically adjusted based on the diffusion step $k$ . High noise levels restrict training to low frequencies, while low noise levels allow for broader spectral training.

2. Progressive Guidance (Inference Phase)

During the reverse denoising process, FGO steers the trajectory toward the full-frequency manifold by synthesizing a composite vector field.

Vector Field Interpolation: At each denoising step $k$ $k$ , the guidance mechanism computes a weighted combination of two conditional noise estimates:
1. $\epsilon_{base}$ : The vector field mapping toward the low-frequency $f_{base}$ -manifold.
2. $\epsilon_{fine}$ : The vector field mapping toward an intermediate $f_k$ -manifold with a higher cut-off frequency.
Composite Field: The final noise estimate is $\tilde{\epsilon} = (1 - \omega_k)\epsilon_{base} + \omega_k \epsilon_{fine}$ .
Progressive Expansion: As the denoising process proceeds (decreasing $k$ ), the cut-off frequency $f_k$ and the guidance weight $\omega_k$ are linearly scheduled to increase. This progressively drives the noisy samples from the low-frequency foundation through expanding sub-frequency manifolds until they reach the full-frequency data manifold.
Approximation: Since the clean action $A^0_t$ is unknown during inference, the frequency-truncated noisy input $A^{k,f}_t$ is approximated by applying the low-pass filter directly to the current noisy state $A^k_t$ .

Key Contributions

Novel Diffusion Guidance Paradigm: The paper introduces a frequency-based guidance mechanism that suppresses high-frequency noise during the denoising process by explicitly controlling the spectral bands traversed during generation.
Multi-Band Training and Inference: The method trains models on a spectrum of frequency-truncated actions and utilizes a progressive guidance strategy during inference to reconstruct actions from low-frequency structures to high-frequency details.
Comprehensive Evaluation: The authors validate FGO across 15 robotic manipulation tasks spanning 5 benchmarks (including Robosuite, MimicGen, Adroit, DexArt, and a real-world xArm setup).
Ablation Studies: The paper provides detailed ablations confirming the necessity of the base frequency sampling, the KFC sampling strategy, and the linear scheduling of guidance weights.

Experimental Results

Success Rate: FGO consistently achieves superior or comparable success rates compared to baselines (DP3, DiT-Policy, and FreqPolicy). On the Robosuite and MimicGen benchmarks, FGO outperformed competitors on 3 of 4 basic tasks and both complex MimicGen tasks. On the Adroit and DexArt dexterous manipulation benchmarks, FGO surpassed baselines on 6 of 7 tasks.
Action Smoothness: FGO significantly improves temporal consistency. On the Robosuite "Can" task, FGO achieved the lowest Action Total Variation (ATV) and a particularly pronounced reduction in JerkRMS compared to all baselines, indicating smoother, less jerky execution.
Real-World Performance: In real-world experiments on an xArm manipulator (Cup and Mouse tasks), FGO consistently outperformed the baseline DP3 method, validating its robustness in physical environments.
Computational Cost: FGO introduces negligible additional training time. However, inference latency is slightly higher than baselines due to the guidance mechanism, a known trade-off for guidance-based algorithms.

Significance and Claims

The paper claims that FGO addresses a fundamental limitation in behavior cloning: the tendency of diffusion policies to inherit and amplify high-frequency noise from human demonstrations. By explicitly steering the generation process through a hierarchy of sub-frequency manifolds, FGO effectively decouples the learning of global kinematic structure (low-frequency) from fine-grained details (high-frequency).

The authors assert that this approach yields policies that are not only more successful in task execution but also produce highly smooth and temporally consistent action trajectories. Unlike standard guidance methods (like Classifier-Free Guidance) which often require extrapolation weights that can destabilize generation, FGO utilizes an interpolation strategy between frequency manifolds, maintaining a stable convex combination of vector fields. The work demonstrates that leveraging frequency-domain inductive biases can significantly enhance the quality and reliability of visuomotor policies in both simulation and real-world robotic applications.

Frequency-Guided Action Diffusion via Sub-Frequency Manifold Traversal