Belief-State RWKV for Reinforcement Learning under Partial Observability

Imagine you are playing a video game where the screen is covered in thick, shifting fog. You can see a little bit of what's in front of you, but you can't see the whole map, and the fog gets thicker or thinner randomly.

The Problem with Old AI:
Most AI agents in these situations act like a person with a very good short-term memory but no sense of confidence. They remember everything they've seen so far and mash it into a single "summary note."

The Flaw: If the fog is thick, the AI might still make a decision based on that note, but it doesn't know how shaky that note is. It's like a detective who has a hunch but doesn't realize they are guessing wildly. It just acts, whether it's sure or not.

The New Idea: "Belief-State RWKV"
The authors of this paper propose a smarter way for the AI to think. Instead of just keeping a single "summary note," they give the AI a two-part dashboard:

The "What I Think" Meter (Location, $\mu$ ): This is the AI's best guess about the current situation (e.g., "I think the enemy is behind that wall").
The "How Sure Am I?" Meter (Uncertainty, $\Sigma$ ): This is a gauge of confidence (e.g., "I'm 90% sure" vs. "I'm basically guessing in the dark").

The Creative Analogy: The Weather Forecaster
Think of the old AI as a weather forecaster who just says, "It will rain tomorrow."
The new Belief-State AI says, "It will rain tomorrow, and I am 95% confident because I see dark clouds. However, if the wind shifts, my confidence drops to 40%."

Because the AI knows how unsure it is, it can change its behavior:

High Confidence: It acts quickly and boldly.
Low Confidence: It waits, gathers more data, or plays it safe. It doesn't just guess blindly.

Why This Matters (The "RWKV" Part)
The paper uses a specific type of AI architecture called RWKV. Think of RWKV as a super-efficient, lightweight engine that can remember long stories without needing a massive computer.

Old Way: The engine runs, but the driver (the AI) is blind to their own uncertainty.
New Way: The engine still runs efficiently, but now the driver has a dashboard showing their confidence level. This allows the AI to make better decisions in tricky, foggy situations without needing a supercomputer.

What the Experiment Showed
The researchers tested this on a simple game where the AI had to guess a hidden number while dealing with random "noise" (static on the line).

The Result: The new AI didn't win every single easy game. In fact, on easy days, the old "summary note" AI was slightly faster.
The Win: But when the game got hard (lots of noise/fog) or when the rules changed slightly (a "shift" in the environment), the new AI with the "Confidence Meter" performed much better. It knew when to wait and when to act, avoiding costly mistakes.

The Takeaway
The paper argues that for AI to be truly smart in uncertain worlds, it shouldn't just remember what happened; it needs to remember how sure it is about what happened. By giving the AI a simple "confidence gauge," we make it more robust, safer, and better at handling the unexpected, all while keeping the system fast and efficient.

In a nutshell: It's about teaching AI to say, "I'm not sure," and then acting accordingly, rather than pretending it knows everything.

1. Problem Statement

The paper addresses a critical limitation in applying Recurrent Neural Networks (specifically RWKV-style architectures) to Reinforcement Learning (RL) under Partial Observability (POMDPs).

The Core Issue: Standard recurrent policies in RL typically condition actions on a single, opaque hidden state vector ( $h_t$ ). While this vector compresses history, it fails to explicitly represent the agent's uncertainty regarding the latent state of the environment.
The Consequence: In partially observed settings (where the agent must infer hidden states from noisy observations), a policy that stores evidence but lacks a representation of confidence may make suboptimal decisions, particularly when facing distribution shifts or high noise.
The Goal: To reinterpret the fixed-size recurrent state of RWKV not just as a memory summary, but as an explicit belief state comprising both a location estimate and an uncertainty estimate, thereby enabling uncertainty-aware control without sacrificing the computational efficiency of RWKV.

2. Methodology: Belief-State RWKV

The authors propose a structural modification to the RWKV architecture where the recurrent state is factorized into a belief state $b_t = (\mu_t, \Sigma_t)$ .

A. Belief-State Formulation

Instead of a single hidden vector, the agent maintains:

$\mu_t$ (Location Statistic): Represents the agent's best estimate of the current state (the mean belief).
$\Sigma_t$ (Uncertainty Statistic): Represents the agent's confidence or variance regarding that estimate.

These are derived from RWKV-style linear recurrent accumulators:

Two separate accumulators ( $s^{(1)}_t$ and $s^{(2)}_t$ ) process the input $x_t$ and a transformed feature $\phi(x_t)$ .
Deterministic maps $f_\mu$ and $f_\Sigma$ convert these accumulators into the belief components:
$\mu_t = f_\mu(s^{(1)}_t), \quad \Sigma_t = f_\Sigma(s^{(1)}_t, s^{(2)}_t)$

B. RWKV Instantiation

The belief readout is integrated into the standard RWKV "Time-Mix" and "Channel-Mix" blocks:

The Time-Mix block aggregates historical information. The belief readout branches directly from this temporal state ( $u_t, s_t$ ) to extract $(\mu_t, \Sigma_t)$ .
The Actor-Critic heads (Policy $\pi$ and Value $V$ ) condition on the belief pair $(\mu_t, \Sigma_t)$ rather than the raw hidden vector.
Extensions: The paper proposes potential extensions, such as Belief-Conditioned Memory Control (where uncertainty modulates the write/retention gates of the memory) and Low-Rank Belief Adapters (to specialize the policy to reward-relevant subspaces).

C. Theoretical Framework

The authors provide a theoretical program to justify the approach, establishing:

Approximate Sufficiency: Proving that if the belief encoder approximates the true posterior, the value loss is bounded by the total variation gap between the true and approximate beliefs.
Stability: Showing that under standard linear recurrence assumptions, the belief state trajectory remains bounded, ensuring stable training.
Low-Rank Structure: Demonstrating that restricting control to a low-rank subspace of the belief state incurs minimal suboptimality if the reward is low-rank relevant.

3. Key Contributions

Architectural Innovation: Introduced a belief-state variant of RWKV where policy and value heads explicitly condition on $(\mu_t, \Sigma_t)$ , creating a compact interface between memory, uncertainty, and control.
Theoretical Formalization: Provided proposition-level proofs regarding approximate sufficiency, stability of linear recurrent belief updates, and low-rank reward relevance.
Empirical Validation: Conducted a pilot experiment in a partially observed "stop-or-guess" environment with hidden episode-level noise.
Ablation Analysis: Demonstrated that a simple belief readout outperforms more complex extensions (like gated memory control or privileged belief targets) in current benchmarks, highlighting the need for richer testbeds.

4. Experimental Results

The authors evaluated the method on a task where an agent must guess a hidden label ( $z \in \{-1, +1\}$ ) based on noisy observations $x_t = z + \epsilon_t$ , where the noise variance $\sigma$ is hidden and varies per episode.

Baselines: Compared against an MLP (memoryless), a standard RWKV summary state, and the proposed Belief-State RWKV.
In-Distribution Performance:
- The standard RWKV summary state achieved the highest mean return ($0.924$).
- The Belief-State RWKV was competitive ($0.919$), slightly trailing in average performance.
Hard Regime & Out-of-Distribution (OOD) Shift:
- Hardest In-Distribution: The Belief-State model slightly outperformed the summary state in the "Very-Hard" noise regime ($0.822$ vs $0.820$).
- Held-Out Noise Shift: When tested on a strictly harder noise range ( $\sigma \in [1.2, 1.8]$ ) not seen during training, the Belief-State model achieved the highest return ($0.650$) compared to the summary state ($0.643$) and MLP ($0.630$).
Calibration: The Belief-State model showed better Expected Calibration Error (ECE) on held-out data, indicating more reliable confidence estimates.
Ablations:
- Gated Memory: Improved in-distribution calibration but did not improve OOD returns.
- Privileged Targets: Accelerated decision-making but caused over-specialization, reducing robustness under distribution shift.

5. Significance and Conclusion

Interface Clarity: The primary contribution is conceptual. By explicitly factorizing the state into belief and uncertainty, the architecture provides researchers with a "leverage point" to interpret and control what the agent stores, moving beyond the "black box" nature of standard hidden states.
Robustness over Average Performance: The results suggest that explicit uncertainty tracking does not necessarily maximize average returns in easy settings but is crucial for robustness in high-uncertainty and shifted-distribution scenarios.
Future Directions: The paper argues that while the simple belief readout is currently the strongest approach, future work should focus on integrating belief with RWKV's memory management (e.g., uncertainty-gated retention) and testing on more complex, long-horizon POMDP benchmarks to fully realize the potential of structured belief states.

In summary, this work proposes a principled way to embed uncertainty into efficient recurrent RL agents, demonstrating that while it may not always win on average, it provides superior resilience when the environment is most unpredictable.

Belief-State RWKV for Reinforcement Learning under Partial Observability

1. Problem Statement

2. Methodology: Belief-State RWKV

A. Belief-State Formulation

B. RWKV Instantiation

C. Theoretical Framework

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

The Diffusion-Attention Connection

Fairboard: a quantitative framework for equity assessment of healthcare models

Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model

Human-like Working Memory Interference in Large Language Models

Active Inference with a Self-Prior in the Mirror-Mark Task