Academic

HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

arXiv:2603.23871v1 Announce Type: new Abstract: Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - "cliff" prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail, generates privileged rollouts by providing the model with ground-truth information, filters for correct solutions, and distills the teacher's token-level distribution into the student. Because teacher and student share the same weights - differing only in their input - the realizability gap is provably bounded, unlike cross-model distillation. We prove that R=1 filtered privileged generation recovers the optimal KL-regularized RL policy in the h

K
Ken Ding
· · 1 min read · 1 views

arXiv:2603.23871v1 Announce Type: new Abstract: Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - "cliff" prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail, generates privileged rollouts by providing the model with ground-truth information, filters for correct solutions, and distills the teacher's token-level distribution into the student. Because teacher and student share the same weights - differing only in their input - the realizability gap is provably bounded, unlike cross-model distillation. We prove that R=1 filtered privileged generation recovers the optimal KL-regularized RL policy in the hard-threshold limit. Experiments on OpenMathInstruct-2 with Qwen2.5-Math-1.5B-Instruct show that HDPO consistently improves coverage metrics (pass@4 by +0.8-1.1%, pass@8 by +0.4-1.7%) while maintaining greedy accuracy, with the distillation weight lambda providing direct control over the exploration-exploitation tradeoff.

Executive Summary

This article proposes a novel approach to training large language models for mathematical reasoning called Hybrid Distillation Policy Optimization (HDPO). HDPO addresses the challenge of 'cliff' prompts, where reinforcement learning gradients vanish, by introducing privileged self-distillation. This technique generates rollouts with ground-truth information, filters for correct solutions, and distills the teacher's token-level distribution into the student. HDPO is proven to recover the optimal KL-regularized RL policy in the hard-threshold limit and demonstrates consistent improvements in coverage metrics while maintaining greedy accuracy. The distillation weight lambda provides direct control over the exploration-exploitation tradeoff. HDPO has the potential to improve the performance of large language models in mathematical reasoning tasks, but its efficacy and scalability in more complex tasks require further investigation.

Key Points

  • HDPO introduces privileged self-distillation to address the 'cliff' prompt challenge in reinforcement learning.
  • HDPO generates rollouts with ground-truth information, filters for correct solutions, and distills the teacher's token-level distribution into the student.
  • HDPO is proven to recover the optimal KL-regularized RL policy in the hard-threshold limit.

Merits

Strength

HDPO addresses a fundamental challenge in training large language models for mathematical reasoning, enabling consistent improvements in coverage metrics while maintaining greedy accuracy.

Advancements

HDPO introduces a novel approach to privileged self-distillation, which has the potential to improve the performance of large language models in mathematical reasoning tasks.

Theoretical Foundations

HDPO is proven to recover the optimal KL-regularized RL policy in the hard-threshold limit, providing a strong theoretical foundation for the approach.

Demerits

Limitation

HDPO's efficacy and scalability in more complex tasks and domains remain to be investigated.

Expert Commentary

HDPO is a significant contribution to the field of AI research, particularly in the area of reinforcement learning for language models. The technique addresses a fundamental challenge in training large language models for mathematical reasoning and has the potential to improve the performance of these models. However, further investigation is needed to determine the efficacy and scalability of HDPO in more complex tasks and domains. Theoretical foundations are also established, which provides a strong foundation for the approach. The practical implications of HDPO are significant, and it has the potential to improve the performance of large language models in mathematical reasoning tasks. The policy implications of HDPO may also be significant, as it may have implications for the development of more advanced AI systems capable of mathematical reasoning and problem-solving.

Recommendations

  • Further investigation is needed to determine the efficacy and scalability of HDPO in more complex tasks and domains.
  • HDPO should be applied to a wider range of mathematical reasoning tasks to evaluate its performance and limitations.

Sources

Original: arXiv - cs.LG