MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

The Big Problem: Learning Without a Teacher's Answer Key

Imagine you are teaching a robot (a Large Language Model) how to solve complex math problems, write code, or answer tricky questions. To teach it, you usually need a "Teacher" to look at every single answer the robot gives and say, "Good job!" or "Try again."

In the real world, this "Teacher" is often a human expert. But here's the catch:

It's expensive: Hiring experts to grade thousands of math proofs costs a fortune.
It's slow: Checking if a piece of code works perfectly takes time.
It's impossible for some things: For open-ended questions (like "What's the best way to save the planet?"), there is no single "correct" answer key.

Because of this, we often only have answer keys for a tiny fraction of the robot's attempts (say, 20%). The other 80% are left in the dark, and the robot can't learn from them. This is like trying to learn a new language when you only have a dictionary for 20% of the words.

The Solution: MemReward (The "Smart Study Group")

The authors of this paper created a system called MemReward. Think of it as a super-smart study group that helps the robot learn even when the teacher is busy.

Here is how it works, step-by-step:

1. The "Experience Library" (The Graph)

Usually, when a robot tries to solve a problem, it generates a "thought process" (how it thinks) and an "answer."

Old way: The robot solves a problem, gets a grade, and that's it. The next problem is treated as a totally new, isolated event.
MemReward way: The robot stores every attempt in a giant, interconnected library.
- Imagine a spiderweb where every node is a question.
- If Question A is about "solving quadratic equations" and Question B is also about "quadratic equations," the web connects them with a strong thread.
- If Question A has a "thinking process" that looks like Question B's, they are connected too.
- This creates a Heterogeneous Graph: a map where questions, thoughts, and answers are all linked together based on how similar they are.

2. The "Smart Predictor" (The GNN)

Once the robot has this library, it trains a special AI (called a Graph Neural Network, or GNN) to look at the web.

The Analogy: Imagine you are in a classroom. You don't know the answer to a hard question. But you look at your neighbors.
- If your neighbor (who is very similar to you) got the answer right, you assume you probably can too.
- If your neighbor got it wrong, you know to be careful.
The GNN does exactly this. It looks at the "neighbors" (similar past questions) in the library. If the neighbors had "Good" answers, the GNN predicts that the current, ungraded answer is likely "Good" too. It propagates the reward from the known answers to the unknown ones.

3. The "Hybrid Teacher" (Online Training)

Now, the robot goes back to training.

For the 20% of questions with a real human teacher, it gets the Real Grade.
For the 80% without a human, it asks the Smart Predictor (the GNN) for a grade based on its study group.
The robot learns from both sources, effectively doubling (or tripling) its learning speed without needing more human teachers.

Why Is This Special? (The Results)

The paper tested this on two different sizes of robots (1.5 billion and 3 billion "brain cells"). Here is what they found:

Almost Perfect with Little Help: Even with only 20% of the answers graded by humans, MemReward performed 97% as well as if 100% of the answers had been graded by humans.
- Analogy: It's like a student who only has a textbook with 20% of the answers filled in, but by using a smart study group, they get almost the same grade as a student with the full answer key.
Better Than Expected on New Stuff: When the robot faced questions it had never seen before (Out-of-Domain), MemReward actually beat the fully-supervised robot.
- Why? Because the "study group" helped the robot realize, "Hey, this new question is just like that old one I solved correctly!" It generalized the knowledge better than just looking at isolated facts.
Thinking Matters: The system didn't just look at the final answer; it looked at the thinking process (the steps the robot took).
- Analogy: If you just look at the final answer "42," you don't know if the student guessed or calculated. But if you look at the steps they took, you can see if they understood the logic. MemReward uses these steps to make better predictions.

The Bottom Line

MemReward is a way to teach AI to learn faster and cheaper. Instead of waiting for a human to grade every single attempt, it builds a connected map of experiences. By seeing how similar past problems were solved, it can guess the quality of new answers with high accuracy.

It turns the problem of "not having enough teachers" into a game of "connecting the dots," allowing AI to reach expert levels of reasoning with a fraction of the human effort.

1. Problem Statement

Reinforcement Learning (RL) for Large Language Models (LLMs), particularly for complex reasoning tasks (math, code, QA), relies heavily on reward labels to verify the correctness of generated rollouts (thinking processes + answers). However, obtaining these labels at scale is a major bottleneck:

Cost & Time: Expert human labeling is expensive (e.g., verifying mathematical proofs), and automated verification is often infeasible for open-ended tasks lacking definitive ground truth.
Label Scarcity: When only a small fraction of data is labeled (e.g., 20%), standard RL fine-tuning (like GRPO) suffers because it discards the vast majority of unlabeled data, leading to suboptimal policy optimization.
Structural Ignorance: Existing semi-supervised approaches often treat experiences independently, failing to leverage the implicit structural dependencies where semantically similar reasoning paths tend to share reward patterns.

2. Methodology: MemReward

The authors propose MemReward, a graph-based experience memory framework designed to propagate reward labels from a small set of labeled data to a large pool of unlabeled data using a Heterogeneous Graph Neural Network (GNN).

A. Experience Memory Construction

Before policy optimization, an initial LLM policy generates rollouts for every query. Each rollout consists of three components:

Query ( $q$ )
Thinking Process ( $t$ )
Final Answer ( $a$ )

These components are stored as nodes in a heterogeneous graph.

B. Heterogeneous Graph Structure

The graph is constructed with three node types and three edge types to capture the reasoning structure:

Nodes: Query nodes, Thinking nodes, and Answer nodes.
Edges:
1. Query-Query Edges: Connect semantically similar queries (based on top- $k$ cosine similarity of embeddings). This allows reward signals to propagate across similar problems.
2. Query-Thinking Edges: Link a query to its generated thinking process.
3. Thinking-Answer Edges: Link a thinking process to its corresponding answer.

C. GNN Training (Warmup Phase)

A Heterogeneous GNN is trained on the labeled subset of the graph:

Message Passing: The GNN aggregates information across the heterogeneous structure using type-specific weight matrices and attention mechanisms. It iteratively updates node embeddings ( $h_q, h_t, h_a$ ) by aggregating neighbors.
Reward Prediction: The final reward score for a rollout is predicted using a scaled dot-product scoring mechanism that combines the final-layer embeddings of the query, thinking process, and answer.
Loss Function: The model is trained to minimize Binary Cross-Entropy (BCE) loss against ground-truth correctness labels (1 for correct, 0 for incorrect).

D. Online Policy Optimization (Inference Phase)

During the RL fine-tuning phase (using GRPO):

Hybrid Reward Acquisition:
- Labeled Queries: Receive ground-truth rewards.
- Unlabeled Queries: Are connected to the pre-trained "warmup" graph via top- $k$ similarity edges. The frozen GNN predicts rewards for these rollouts.
Integration with GRPO: The predicted rewards (thresholded at 0.5 to create binary signals) are combined with ground-truth rewards to compute the advantage function. This allows the policy to learn from the full dataset (100% of queries) rather than just the 20% labeled subset.

3. Key Contributions

Graph-Based Reward Propagation: MemReward is the first framework to explicitly model the query-thinking-answer relationship as a heterogeneous graph to propagate reward signals in RL fine-tuning.
Effective Semi-Supervised RL: It demonstrates that RL fine-tuning can achieve near-Oracle performance using only 20% ground-truth labels, effectively utilizing the remaining 80% of data via GNN-predicted rewards.
Cross-Domain Generalization: The framework trains a single shared GNN across diverse domains (Math, QA, Code). Remarkably, it surpasses fully-supervised Oracle performance on out-of-domain tasks, suggesting that the graph structure captures transferable reasoning patterns better than raw ground-truth supervision alone.
Architectural Insights: Ablation studies confirm that the heterogeneous edge types (distinguishing query similarity from reasoning steps) and the inclusion of thinking nodes are critical for capturing the nuances of multi-step reasoning.

4. Experimental Results

Experiments were conducted on Qwen2.5-3B and Qwen2.5-1.5B across 13 benchmarks (Math, QA, Code).

Performance with Limited Labels:
- With only 20% labels, MemReward achieved 97.3% of Oracle performance on the 3B model and 96.6% on the 1.5B model.
- It significantly outperformed the baseline "R1-p" (which discards 80% of data), improving average scores by 1.35 points (3B) and 5.38 points (1.5B).
Out-of-Domain (OOD) Generalization:
- MemReward surpassed the fully-supervised Oracle on OOD tasks (NuminaMath, SIQA, PIQA).
- On the 3B model, MemReward scored 66.96 vs. Oracle's 66.07. This indicates that GNN-predicted rewards improve generalization beyond what is possible with full supervision on the training distribution.
Task-Specific Gains:
- Mathematical Reasoning benefited the most (e.g., +14.9 points on GSM-Symbolic for 1.5B), likely due to the high structural similarity between math problems allowing effective reward propagation.
- Code Generation showed smaller but consistent gains, validating the utility of thinking-process nodes.
Scalability: Performance scales smoothly with label budget, reaching 99.4% of Oracle performance at 70% labels.

5. Significance and Impact

Democratizing RL Training: MemReward drastically reduces the reliance on expensive human labeling, making RL fine-tuning accessible for researchers and organizations with limited labeling budgets.
Beyond Supervision: The finding that graph-based propagation can outperform full supervision on OOD tasks challenges the assumption that more ground-truth labels always yield better generalization. It suggests that learning the structure of reasoning is more valuable than simply memorizing correct answers.
Scalable Framework: The method is compatible with standard RL algorithms (GRPO) and can be applied to various LLM sizes and reasoning domains without requiring architectural changes to the LLM itself.

In summary, MemReward provides a robust solution to the label scarcity problem in LLM reasoning by treating the reasoning process as a structured, interconnected memory, enabling efficient and effective reinforcement learning with minimal human supervision.