MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels
arXiv:2603.19310v1 Announce Type: new Abstract: Training large language models (LLMs) for complex reasoning via reinforcement learning requires reward labels that specify whether the generated rollouts are correct. However, obtaining reward labels at scale often requires expensive human labeling or time-consuming verification procedures; for instance, evaluating mathematical proofs demands expert review, while open-ended question answering lacks definitive ground truth. When reward labels are limited, the effectiveness of reinforcement learning fine-tuning is constrained by the scarcity of reward labels. We introduce MemReward, a graph-based experience memory framework: an initial LLM policy generates rollouts for each query, each comprising a thinking process and a final answer, and these rollouts are stored as experience memory. Queries, thinking processes, and answers form nodes in a heterogeneous graph with similarity and structural edges; a GNN trained on labeled nodes propagates
arXiv:2603.19310v1 Announce Type: new Abstract: Training large language models (LLMs) for complex reasoning via reinforcement learning requires reward labels that specify whether the generated rollouts are correct. However, obtaining reward labels at scale often requires expensive human labeling or time-consuming verification procedures; for instance, evaluating mathematical proofs demands expert review, while open-ended question answering lacks definitive ground truth. When reward labels are limited, the effectiveness of reinforcement learning fine-tuning is constrained by the scarcity of reward labels. We introduce MemReward, a graph-based experience memory framework: an initial LLM policy generates rollouts for each query, each comprising a thinking process and a final answer, and these rollouts are stored as experience memory. Queries, thinking processes, and answers form nodes in a heterogeneous graph with similarity and structural edges; a GNN trained on labeled nodes propagates rewards to unlabeled rollouts during online optimization. Experiments on Qwen2.5-3B and 1.5B across mathematics, question answering, and code generation demonstrate that MemReward, with only 20% labels, achieves 97.3% of Oracle performance on 3B and 96.6% on 1.5B, surpassing Oracle on out-of-domain tasks. Performance scales smoothly with label budget, reaching 99.4% of Oracle at 70% labels.
Executive Summary
MemReward, a graph-based experience memory framework, addresses the issue of limited reward labels in large language model (LLM) training. By storing generated rollouts as experience memory and training a Graph Neural Network (GNN) to propagate rewards, MemReward achieves high performance even with limited labels. Experiments demonstrate its effectiveness on various tasks, including mathematics, question answering, and code generation. MemReward's performance scales smoothly with the label budget, making it a promising approach for training LLMs in resource-constrained settings.
Key Points
- ▸ MemReward uses a graph-based experience memory framework to store generated rollouts.
- ▸ A GNN is trained to propagate rewards to unlabeled rollouts during online optimization.
- ▸ MemReward achieves high performance with limited labels, reaching 97.3% of Oracle performance at 20% labels.
Merits
Strength in Handling Limited Labels
MemReward effectively addresses the issue of limited reward labels in LLM training, a significant challenge in reinforcement learning.
Scalability with Label Budget
MemReward's performance scales smoothly with the label budget, making it a promising approach for resource-constrained settings.
Demerits
Dependence on GNN Performance
MemReward's effectiveness relies on the performance of the GNN, which may require significant computational resources and expertise.
Potential Overfitting to Labeled Nodes
MemReward's GNN may overfit to labeled nodes, reducing its ability to generalize to new, unlabeled tasks.
Expert Commentary
MemReward's innovative approach to addressing the issue of limited reward labels is a significant advancement in the field of reinforcement learning. However, its effectiveness relies on the performance of the GNN, which may require significant computational resources and expertise. Additionally, the potential for overfitting to labeled nodes is a concern that needs to be addressed through further research. Nevertheless, MemReward's ability to scale smoothly with the label budget makes it a promising approach for resource-constrained settings.
Recommendations
- ✓ Further research is needed to explore the potential of MemReward in more complex tasks and domains.
- ✓ The development of more efficient GNN architectures and training methods is essential to fully realize the potential of MemReward.
Sources
Original: arXiv - cs.LG