Intention-Conditioned Flow Occupancy Models

Imagine you are trying to teach a robot how to do a million different chores: folding laundry, cooking dinner, fixing a leaky faucet, and playing chess.

In the past, to teach a robot a new trick, you had to sit down with it and show it exactly how to do that specific task from scratch. It was slow, expensive, and the robot would forget everything once you asked it to do something slightly different.

This paper introduces a new method called InFOM (Intention-Conditioned Flow Occupancy Models) that changes the game. Think of it as giving the robot a "super-intuition" before it even starts learning a specific job.

Here is how it works, broken down into simple concepts:

1. The Problem: The Robot is Confused by Mixed Signals

Imagine you walk into a room where a hundred different people are doing different things. One is dancing, one is cooking, and one is fixing a car. If you just watch the room for a while, you see a chaotic mess of movements.

If you try to teach a robot by showing it this chaotic room, the robot gets confused. It doesn't know why the person is moving their arm. Are they waving hello? Are they reaching for a cup? Are they swatting a fly?

Most AI methods try to guess the action (the arm movement). But this paper says: "No, let's guess the intention (the goal)."

2. The Solution: The "Mind Reader" (Intention Inference)

The authors built a system that acts like a mind reader. Instead of just watching the robot move, it looks at the movement and asks: "What is this person trying to achieve?"

The Analogy: Imagine you see someone walking toward a fridge, opening it, and grabbing a soda.
- Old AI: "Okay, I see a hand opening a door. I will memorize that hand motion."
- InFOM: "Ah, I see they are thirsty. I will remember the concept of 'quenching thirst'."

The system takes a massive, messy dataset of people doing all sorts of things and secretly groups them by their hidden goals (intentions). It learns that "grabbing a cup" usually means "drinking," while "grabbing a wrench" usually means "fixing."

3. The "Time Machine" (Flow Occupancy Models)

Once the robot understands the intention, it needs to know what happens next. This is where the "Flow" part comes in.

Think of a river. If you drop a leaf in the water at point A, you can predict where it will be in 10 seconds, 1 minute, or 1 hour. The water flows in a specific direction based on the current.

In the robot's world, the "current" is the intention.

If the intention is "make a sandwich," the "river" of future states flows toward the fridge, then the counter, then the toaster.
If the intention is "clean the floor," the river flows toward the broom, then the dustpan.

The paper uses a mathematical tool called Flow Matching to map out these rivers. It doesn't just predict the next step; it predicts the entire future path of the robot based on that specific intention. It's like the robot has a crystal ball that shows all the possible futures for a specific goal.

4. The "Master Chef" (Generalized Policy Improvement)

Now, the robot has a library of "future rivers" for every possible intention it has ever seen.

When you give the robot a new task (e.g., "Make a smoothie"), it doesn't start from zero. It looks at its library of intentions and says:

"Okay, making a smoothie is similar to 'making a sandwich' (grabbing ingredients) and 'cleaning' (washing the blender)."

It then mixes and matches these pre-learned "rivers" to figure out the best path to the goal. It's like a chef who has practiced making 1,000 different dishes. When asked to make a new recipe, they don't panic; they just combine the techniques they already know.

Why is this a big deal?

Speed: Because the robot already understands the "flow" of different goals, it learns new tasks incredibly fast. It's like the difference between learning to drive a car from scratch vs. learning to drive a truck when you already know how to drive a car.
Robustness: If the robot gets stuck or the environment changes slightly, it can look at its "intention map" and find a new path, rather than freezing up.
Efficiency: It can learn from messy, unlabeled data (like hours of security camera footage of people doing random things) without needing a human to label every single action.

The Bottom Line

InFOM is a way to teach robots to understand why things are happening, not just what is happening. By learning the hidden "intentions" behind actions and mapping out the future paths those intentions create, the robot becomes a master of adaptation, able to pick up new skills almost instantly.

It's the difference between a robot that memorizes a script and a robot that understands the story.

1. Problem Statement

The paper addresses the challenge of applying the pre-training and fine-tuning paradigm (successful in NLP and Computer Vision) to Reinforcement Learning (RL). While large foundation models have revolutionized other fields, building them for RL remains difficult due to two core issues:

Long-term Dependencies: RL agents must reason about the long-term consequences of actions, requiring models that understand temporal dynamics over extended horizons.
Heterogeneous Data & Intentions: Offline RL datasets are often collected by multiple users performing different tasks. Current methods often treat this data as a monolithic distribution or fail to explicitly model the underlying user intentions (latent goals) that drive the behavior.

Existing approaches like World Models suffer from compounding errors in long-horizon reasoning, while standard Occupancy Models often ignore user intentions or are difficult to train. The goal is to create a foundation model that can learn a probabilistic representation of temporally distant future states conditioned on latent user intentions from unlabeled, heterogeneous offline data.

2. Methodology: InFOM

The authors propose Intention-Conditioned Flow Occupancy Models (InFOM), a framework that combines variational inference, flow matching, and generalized policy improvement. The method operates in two stages:

A. Problem Setting

Data: An unlabeled, reward-free dataset $D$ collected by a behavioral policy $\beta$ (which is a mixture of policies from different users with different intentions $z$ ). A small reward-labeled dataset $D_{reward}$ is used for fine-tuning.
Assumption: Consecutive transitions $(s, a)$ and $(s', a')$ share the same latent intention $z$ .

B. Pre-training Phase

The goal is to learn a latent variable model that captures both temporal dynamics and intentions.

Variational Intention Inference:
- The model uses an encoder $p_\phi(z | s', a')$ to infer the latent intention $z$ from the next transition.
- It maximizes the Evidence Lower Bound (ELBO) to predict future states $s_f$ given current $(s, a)$ and the inferred intention $z$ .
- This is formulated as an information bottleneck: $(S', A') \to Z \to (S, A, S_f)$ .
Flow Matching for Occupancy Measures:
- Instead of standard likelihood maximization, the authors use Flow Matching (based on Ordinary Differential Equations) to model the discounted state occupancy measure $p_\gamma(s_f | s, a, z)$ .
- They employ a Temporal Difference (TD) Flow objective (specifically a SARSA variant). This incorporates the Bellman equation into the flow matching loss, allowing the model to stitch together trajectory segments and perform dynamic programming.
- The loss function combines a "current flow" term (reconstructing the current state) and a "future flow" term (bootstrapping from the next state-action pair).

C. Fine-tuning Phase

Once the occupancy model is pre-trained, it is adapted to specific downstream tasks with rewards.

Generative Value Estimation:
- For a downstream task, the model samples future states $s_f$ from the pre-trained flow occupancy model conditioned on sampled intentions $z \sim p(z)$ .
- The Q-value is estimated via Monte Carlo: $Q_z(s, a) \approx \frac{1}{1-\gamma} \mathbb{E}[r(s_f)]$ .
Implicit Generalized Policy Improvement (Implicit GPI):
- Standard GPI requires maximizing over a finite set of intentions, which is unstable and prone to local optima when the intention space is continuous.
- Innovation: InFOM replaces the discrete "max" over intentions with an upper expectile loss. A scalar critic $Q(s, a)$ is trained to distill the distribution of intention-conditioned Q-values ( $Q_z$ ).
- This acts as a "soft" maximization over the infinite space of latent intentions, providing a more robust and stable policy update.
Policy Extraction:
- The final policy is optimized to maximize the distilled $Q(s, a)$ with a behavioral cloning regularization term to prevent out-of-distribution (OOD) action errors.

3. Key Contributions

Intention-Conditioned Flow Occupancy Models: A novel architecture that unifies variational intention inference with flow-based generative modeling to predict long-horizon future states conditioned on latent user goals.
Implicit Generalized Policy Improvement: A new policy extraction strategy that uses upper expectile loss to perform a relaxed maximization over a continuous space of latent intentions, avoiding the instability of discrete GPI.
SARSA Flow Matching: Adapting flow matching with a SARSA-style TD loss to learn occupancy measures, enabling efficient dynamic programming and combinatorial generalization.
Comprehensive Benchmarking: Extensive evaluation on 36 state-based and 4 image-based tasks, demonstrating significant improvements over state-of-the-art offline RL and pre-training baselines.

4. Experimental Results

The authors evaluated InFOM on ExORL (16 state-based tasks) and OGBench (20 state-based + 4 image-based tasks), comparing against 8 baselines including IQL, ReBRAC, MBPO, and various unsupervised skill learning methods.

Performance Gains:
- Returns: Achieved a 1.8× median improvement in returns compared to baselines.
- Success Rates: Increased success rates by 36% on average.
- Specific Domains: Showed massive gains in challenging manipulation tasks (e.g., 20× improvement on the high-dimensional jaco domain) and visual tasks (31% improvement over the best baseline).
Ablation Studies:
- Intention Encoder: Visualizations (t-SNE) confirmed that InFOM's variational encoder successfully clusters latent intentions corresponding to distinct behaviors (e.g., "pick" vs. "place"), outperforming Hilbert and Forward-Backward representation baselines.
- Implicit GPI: The proposed implicit GPI strategy outperformed standard discrete GPI by 44% with 8× lower variance, proving the stability of the expectile-based approach.
- Discrete vs. Continuous: Continuous latent spaces were found to be superior to discrete vector quantization, especially on complex OGBench tasks.
- Data Efficiency: InFOM demonstrated faster convergence during fine-tuning compared to methods relying on one-step transition models or self-supervised representations.

5. Significance

This work represents a significant step toward foundation models for Reinforcement Learning. By successfully modeling the joint distribution of long-term future states and latent user intentions, InFOM bridges the gap between large-scale pre-training and specific task adaptation.

Robustness to Sparse Rewards: The ability to infer intentions from unlabeled data allows the agent to explore diverse regions of the state space, mitigating the challenges of sparse reward functions that plague many offline RL algorithms.
Scalability: The use of Flow Matching (ODE-based) offers faster inference and more stable training compared to Diffusion models, making it suitable for high-dimensional control tasks.
Generalization: The framework provides a principled way to transfer knowledge across tasks with different goals but similar underlying dynamics, a crucial capability for real-world robotic deployment.

In summary, InFOM demonstrates that modeling intentions and long-horizon occupancy simultaneously via flow matching creates a powerful foundation for efficient, robust, and generalizable offline reinforcement learning.