Academic

RefineRL: Advancing Competitive Programming with Self-Refinement Reinforcement Learning

arXiv:2604.00790v1 Announce Type: new Abstract: While large language models (LLMs) have demonstrated strong performance on complex reasoning tasks such as competitive programming (CP), existing methods predominantly focus on single-attempt settings, overlooking their capacity for iterative refinement. In this paper, we present RefineRL, a novel approach designed to unleash the self-refinement capabilities of LLMs for CP problem solving. RefineRL introduces two key innovations: (1) Skeptical-Agent, an iterative self-refinement agent equipped with local execution tools to validate generated solutions against public test cases of CP problems. This agent always maintains a skeptical attitude towards its own outputs and thereby enforces rigorous self-refinement even when validation suggests correctness. (2) A reinforcement learning (RL) solution to incentivize LLMs to self-refine with only standard RLVR data (i.e., problems paired with their verifiable answers). Extensive experiments on Qw

S
Shaopeng Fu, Xingxing Zhang, Li Dong, Di Wang, Furu Wei
· · 1 min read · 0 views

arXiv:2604.00790v1 Announce Type: new Abstract: While large language models (LLMs) have demonstrated strong performance on complex reasoning tasks such as competitive programming (CP), existing methods predominantly focus on single-attempt settings, overlooking their capacity for iterative refinement. In this paper, we present RefineRL, a novel approach designed to unleash the self-refinement capabilities of LLMs for CP problem solving. RefineRL introduces two key innovations: (1) Skeptical-Agent, an iterative self-refinement agent equipped with local execution tools to validate generated solutions against public test cases of CP problems. This agent always maintains a skeptical attitude towards its own outputs and thereby enforces rigorous self-refinement even when validation suggests correctness. (2) A reinforcement learning (RL) solution to incentivize LLMs to self-refine with only standard RLVR data (i.e., problems paired with their verifiable answers). Extensive experiments on Qwen3-4B and Qwen3-4B-2507 demonstrate that our method yields substantial gains: after our RL training, these compact 4B models integrated with the Skeptical-Agent not only outperform much larger 32B models but also approach the single-attempt performance of 235B models. These findings suggest that self-refinement holds considerable promise for scaling LLM reasoning, with significant potential for further advancement.

Executive Summary

The article presents RefineRL, a novel approach that leverages self-refinement capabilities of large language models (LLMs) for competitive programming (CP) problem solving. By introducing a skeptical agent that iteratively refines solutions against public test cases and a reinforcement learning solution to incentivize self-refinement, RefineRL yields substantial gains in performance. The approach outperforms larger models and approaches the single-attempt performance of much larger models, suggesting significant potential for scaling LLM reasoning.

Key Points

  • RefineRL introduces a skeptical agent for iterative self-refinement in LLMs
  • The approach leverages reinforcement learning to incentivize self-refinement with standard RLVR data
  • Extensive experiments demonstrate substantial gains in performance

Merits

Strength in Addressing Limitations of LLMs

RefineRL addresses the limitations of existing LLM approaches that predominantly focus on single-attempt settings, overlooking the capacity for iterative refinement.

Demerits

Potential Overreliance on Public Test Cases

The approach relies on public test cases for validation, which may not be comprehensive or representative of real-world scenarios.

Expert Commentary

While RefineRL demonstrates impressive gains in performance, the approach relies on the availability of public test cases, which may limit its applicability in real-world scenarios. Furthermore, the findings raise questions about the potential for LLMs to generalize beyond the training data. To fully realize the potential of self-refinement in LLMs, further research is needed to address these limitations.

Recommendations

  • Future research should focus on developing self-refinement techniques that do not rely on public test cases
  • Investigation into the generalizability of LLMs beyond the training data is essential

Sources

Original: arXiv - cs.AI