TRIMS: Trajectory-Ranked Instruction Masked Supervision for Diffusion Language Models
arXiv:2604.00666v1 Announce Type: new Abstract: Diffusion language models (DLMs) offer a promising path toward low-latency generation through parallel decoding, but their practical efficiency depends heavily on the decoding trajectory. In practice, this advantage often fails to fully materialize because standard training does not provide explicit supervision over token reveal order, creating a train-inference mismatch that leads to suboptimal decoding behavior. We propose Trajectory-Ranked Instruction Masked Supervision (TRIMS), a simple trajectory-guided supervised fine-tuning framework that injects trajectory supervision into standard Masked Diffusion Language Model (MDLM) training with minimal overhead. Instead of relying on costly DLM-based distillation, TRIMS uses lightweight signals from an autoregressive teacher to guide a trajectory-aware masking strategy, encouraging the model to learn more effective decoding orders. Experiments on LLaDA and Dream across math and coding bench
arXiv:2604.00666v1 Announce Type: new Abstract: Diffusion language models (DLMs) offer a promising path toward low-latency generation through parallel decoding, but their practical efficiency depends heavily on the decoding trajectory. In practice, this advantage often fails to fully materialize because standard training does not provide explicit supervision over token reveal order, creating a train-inference mismatch that leads to suboptimal decoding behavior. We propose Trajectory-Ranked Instruction Masked Supervision (TRIMS), a simple trajectory-guided supervised fine-tuning framework that injects trajectory supervision into standard Masked Diffusion Language Model (MDLM) training with minimal overhead. Instead of relying on costly DLM-based distillation, TRIMS uses lightweight signals from an autoregressive teacher to guide a trajectory-aware masking strategy, encouraging the model to learn more effective decoding orders. Experiments on LLaDA and Dream across math and coding benchmarks show that TRIMS significantly improves the accuracy-parallelism trade-off over both standard MDLM training and train-free acceleration baselines, while achieving competitive performance with prior distillation-based approaches at substantially lower training cost. Further analysis shows that TRIMS leads to better decoding trajectories, validating the effectiveness of trajectory-guided supervision for DLMs.
Executive Summary
This article proposes Trajectory-Ranked Instruction Masked Supervision (TRIMS), a novel framework for fine-tuning diffusion language models (DLMs) to improve their decoding efficiency. TRIMS injects trajectory supervision into standard masked diffusion language model (MDLM) training, leveraging lightweight signals from an autoregressive teacher to guide a trajectory-aware masking strategy. Experiments on LLaDA and Dream demonstrate TRIMS' effectiveness in improving the accuracy-parallelism trade-off, achieving competitive performance with prior distillation-based approaches at lower training costs. Further analysis highlights TRIMS' potential in optimizing DLM decoding trajectories. While TRIMS shows promise, its applicability and limitations require further exploration.
Key Points
- ▸ TRIMS injects trajectory supervision into standard MDLM training
- ▸ TRIMS leverages lightweight signals from an autoregressive teacher
- ▸ TRIMS improves accuracy-parallelism trade-off and achieves competitive performance
Merits
Strength
TRIMS offers a novel and efficient approach to fine-tuning DLMs, leveraging trajectory supervision to improve decoding efficiency.
Enhanced Performance
TRIMS demonstrates improved accuracy-parallelism trade-off and competitive performance with prior distillation-based approaches at lower training costs.
Improved Decoding Trajectories
TRIMS leads to better decoding trajectories, validating the effectiveness of trajectory-guided supervision for DLMs.
Demerits
Limitation
The applicability and limitations of TRIMS require further exploration, particularly in terms of its scalability and generalizability.
Expert Commentary
The proposed TRIMS framework offers a promising direction for improving the efficiency of DLMs. By leveraging lightweight signals from an autoregressive teacher, TRIMS provides a more efficient and effective approach to fine-tuning DLMs. The experimental results demonstrate TRIMS' potential in optimizing DLM decoding trajectories and improving the accuracy-parallelism trade-off. However, further investigation is necessary to fully explore TRIMS' applicability and limitations, particularly in terms of its scalability and generalizability. Nevertheless, TRIMS' innovative approach and demonstrated effectiveness make it an exciting area of research for the development of more efficient and effective AI language models.
Recommendations
- ✓ Future research should focus on scaling up TRIMS to larger datasets and exploring its generalizability to other types of language models.
- ✓ Investigation into the potential applications of TRIMS in real-world scenarios, such as real-time language translation or text summarization, is necessary to fully understand its practical implications.
Sources
Original: arXiv - cs.CL