dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models
arXiv:2603.18806v1 Announce Type: new Abstract: Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation, which in turn presents new challenges for aligning them with human preferences. In this work, we aim to improve the policy optimization for dLLMs by reducing the cost of the trajectory probability calculation, thereby enabling scaled-up offline policy training. We prove that: (i) under reference policy regularization, the probability ratio of the newly unmasked tokens is an unbiased estimate of that of intermediate diffusion states, and (ii) the probability of the full trajectory can be effectively estimated with a single forward pass of a re-masked final state. By integrating these two trajectory reduction strategies into a policy optimization objective, we propose Trajectory Reduction Policy Optimization (dTRPO). We evaluate dTRPO on 7B dLLMs across instruction-following and reasoning benchmarks. Results show that it substantially improves the cor
arXiv:2603.18806v1 Announce Type: new Abstract: Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation, which in turn presents new challenges for aligning them with human preferences. In this work, we aim to improve the policy optimization for dLLMs by reducing the cost of the trajectory probability calculation, thereby enabling scaled-up offline policy training. We prove that: (i) under reference policy regularization, the probability ratio of the newly unmasked tokens is an unbiased estimate of that of intermediate diffusion states, and (ii) the probability of the full trajectory can be effectively estimated with a single forward pass of a re-masked final state. By integrating these two trajectory reduction strategies into a policy optimization objective, we propose Trajectory Reduction Policy Optimization (dTRPO). We evaluate dTRPO on 7B dLLMs across instruction-following and reasoning benchmarks. Results show that it substantially improves the core performance of state-of-the-art dLLMs, achieving gains of up to 9.6% on STEM tasks, up to 4.3% on coding tasks, and up to 3.0% on instruction-following tasks. Moreover, dTRPO exhibits strong training efficiency due to its offline, single-forward nature, and achieves improved generation efficiency through high-quality outputs.
Executive Summary
The article proposes a novel policy optimization approach, Trajectory Reduction Policy Optimization (dTRPO), to improve the performance of Diffusion Large Language Models (dLLMs). dTRPO reduces the cost of trajectory probability calculation by leveraging two key strategies: (i) reference policy regularization and (ii) re-masked final state estimation. This enables scaled-up offline policy training and achieves substantial performance gains of up to 9.6% on STEM tasks, 4.3% on coding tasks, and 3.0% on instruction-following tasks. The approach also exhibits strong training efficiency and improved generation efficiency. The findings have significant implications for the development of more efficient and effective dLLMs, particularly in applications requiring high-quality outputs.
Key Points
- ▸ dTRPO reduces the cost of trajectory probability calculation for dLLMs
- ▸ Reference policy regularization and re-masked final state estimation are key strategies for dTRPO
- ▸ dTRPO achieves substantial performance gains across various tasks and exhibits strong training efficiency
Merits
Strength
dTRPO's offline, single-forward nature enables significant training efficiency improvements
Strength
dTRPO's high-quality outputs result in improved generation efficiency
Strength
dTRPO achieves substantial performance gains across various tasks
Demerits
Limitation
The approach may not generalize well to other types of language models beyond dLLMs
Limitation
The computational cost of re-masked final state estimation may be high for large models
Limitation
The approach requires careful tuning of hyperparameters to achieve optimal performance
Expert Commentary
The article presents a well-researched and well-implemented approach to policy optimization for dLLMs. The authors' use of reference policy regularization and re-masked final state estimation is a key innovation that enables the efficient training of large language models. The substantial performance gains achieved by dTRPO across various tasks are impressive and demonstrate the potential of this approach. However, as with any research, there are limitations and areas for future work. The approach may not generalize well to other types of language models beyond dLLMs, and the computational cost of re-masked final state estimation may be high for large models. Nevertheless, the findings of this study are significant and have important implications for the development of more efficient and effective dLLMs.
Recommendations
- ✓ Future research should explore the application of dTRPO to other types of language models
- ✓ Careful tuning of hyperparameters is necessary to achieve optimal performance with dTRPO