Think First, Diffuse Fast: Improving Diffusion Language Model Reasoning via Autoregressive Plan Conditioning
arXiv:2603.13243v1 Announce Type: new Abstract: Diffusion large language models (dLLMs) generate text via iterative denoising but consistently underperform on multi-step reasoning. We hypothesize this gap stems from a coordination problem: AR models build coherence token-by-token, while diffusion models must coordinate all positions simultaneously. We propose plan conditioning, a training-free method that prepends a short (~100-token) natural-language plan from an AR model to the diffusion model's prompt. The plan serves as a frozen scaffold -- globally visible context that every token position can attend to from the first denoising step. On GSM8K, plan conditioning improves LLaDA-8B-Instruct from 75.6% to 87.2% (+11.6 percentage points), matching a same-size AR model (LLaMA 3.1 8B, 87.7%) despite a 6.4pp weaker baseline. On HumanEval, the gain is +12.8pp (37.2% to 50.0%), showing plans generalize to code. The same plans improve LLaMA by only +5.7pp on GSM8K and +1.3pp on HumanEval --
arXiv:2603.13243v1 Announce Type: new Abstract: Diffusion large language models (dLLMs) generate text via iterative denoising but consistently underperform on multi-step reasoning. We hypothesize this gap stems from a coordination problem: AR models build coherence token-by-token, while diffusion models must coordinate all positions simultaneously. We propose plan conditioning, a training-free method that prepends a short (~100-token) natural-language plan from an AR model to the diffusion model's prompt. The plan serves as a frozen scaffold -- globally visible context that every token position can attend to from the first denoising step. On GSM8K, plan conditioning improves LLaDA-8B-Instruct from 75.6% to 87.2% (+11.6 percentage points), matching a same-size AR model (LLaMA 3.1 8B, 87.7%) despite a 6.4pp weaker baseline. On HumanEval, the gain is +12.8pp (37.2% to 50.0%), showing plans generalize to code. The same plans improve LLaMA by only +5.7pp on GSM8K and +1.3pp on HumanEval -- diffusion models benefit 2-10x more, supporting the coordination-problem hypothesis. Across 5 random seeds, plan-conditioned GSM8K accuracy has zero standard deviation, making diffusion inference highly stable. Ablations reveal the model follows plan strategy (wrong-strategy plans cause -16.3pp) but is robust to plan values (perturbed numbers: -1.1pp), and that planner quality has a sharp threshold: smaller Llama-class plans hurt (-1.6 to -6.8pp) while frontier plans provide the full lift. Attention analysis confirms the mechanism: plan tokens receive 1.8x excess attention during early denoising, declining to uniform as completion tokens solidify. Plan conditioning costs ~$0.002 per problem and adds ~2s of latency.
Executive Summary
This article introduces 'plan conditioning,' a novel, training-free intervention that improves diffusion large language model (dLLM) reasoning by prepending a short autoregressive (AR) model-generated planning text to the diffusion prompt. The mechanism addresses a coordination gap: AR models build coherence sequentially, while diffusion models operate positionally. Empirical results show significant gains—up to 11.6 percentage points on GSM8K and 12.8 percentage points on HumanEval—when plan conditioning is applied to LLaDA-8B-Instruct, with comparable AR models seeing negligible improvements, confirming the coordination hypothesis. The intervention is low-cost (≈$0.002 per problem), introduces minimal latency, and demonstrates remarkable stability across seeds. Ablations confirm causal impact: wrong-strategy plans degrade performance, while plan quality exhibits a sharp threshold effect. Attention analysis validates the mechanism via early-step heightened attention to plan tokens.
Key Points
- ▸ Plan conditioning improves dLLM reasoning via prepended AR-generated plan scaffold
- ▸ Significant gains observed on benchmark datasets (GSM8K +11.6pp, HumanEval +12.8pp)
- ▸ Diffusion models benefit disproportionately more than AR models, supporting coordination hypothesis
Merits
Empirical Validation
Robust, statistically significant improvements across multiple benchmarks with reproducible results (zero standard deviation across 5 seeds)
Demerits
Limited Applicability to AR Models
AR models experience negligible gains (≈+5.7pp on GSM8K), suggesting the intervention is diffusion-specific and not universally applicable
Expert Commentary
This work represents a sophisticated, minimalist intervention that elegantly addresses a subtle but critical bottleneck in diffusion-based reasoning. The authors avoid over-engineering by leveraging existing AR models’ outputs as frozen scaffolds—a clever use of synergy between modalities. The sharp threshold effect in plan quality—where frontier plans provide disproportionate gains—is particularly insightful; it suggests that the quality of the planning signal matters more than quantity, aligning with cognitive theories of scaffolding. Moreover, the stability of inference under plan conditioning (zero variance across seeds) is a significant operational advantage for deployment. While the intervention is diffusion-specific, its implications extend beyond: it opens the door to modular, context-aware augmentation strategies across generative architectures. The cost-benefit ratio is exemplary: negligible overhead with disproportionate performance uplift. This is a model of efficient innovation.
Recommendations
- ✓ 1. Integrate plan conditioning as a default preprocessing layer in diffusion-based LLM deployment pipelines for reasoning-heavy applications.
- ✓ 2. Extend ablation studies to other modalities (e.g., multimodal or structural inputs) to assess generalizability beyond textual reasoning.