DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use
arXiv:2603.11076v1 Announce Type: new Abstract: Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks. Scaling diversity is difficult because training requires tasks to remain executable and verifiable, while generalization demands coverage of diverse tool types, toolset combinations, and heterogeneous tool-use patterns. We propose DIVE, an evidence-driven recipe that inverts synthesis order, executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces, thereby providing grounding by construction. DIVE scales structural diversity along two controllable axes, tool-pool coverage and per-task toolset variety, and an Evidence Collection--Task Derivation loop further induces rich multi-step tool-use patterns across 373 tools in five domains. Training Qwen3-8B on D
arXiv:2603.11076v1 Announce Type: new Abstract: Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks. Scaling diversity is difficult because training requires tasks to remain executable and verifiable, while generalization demands coverage of diverse tool types, toolset combinations, and heterogeneous tool-use patterns. We propose DIVE, an evidence-driven recipe that inverts synthesis order, executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces, thereby providing grounding by construction. DIVE scales structural diversity along two controllable axes, tool-pool coverage and per-task toolset variety, and an Evidence Collection--Task Derivation loop further induces rich multi-step tool-use patterns across 373 tools in five domains. Training Qwen3-8B on DIVE data (48k SFT + 3.2k RL) improves by +22 average points across 9 OOD benchmarks and outperforms the strongest 8B baseline by +68. Remarkably, controlled scaling analysis reveals that diversity scaling consistently outperforms quantity scaling for OOD generalization, even with 4x less data.
Executive Summary
The article proposes DIVE, a novel approach to scaling diversity in agentic task synthesis for generalizable tool use in large language models. By inverting the synthesis order and executing diverse real-world tools first, DIVE provides grounding by construction and scales structural diversity along two controllable axes. The results show significant improvements in out-of-distribution generalization, outperforming the strongest baseline by 68 points. The study highlights the importance of diversity scaling over quantity scaling for achieving robust generalization.
Key Points
- ▸ DIVE inverts the synthesis order to execute diverse real-world tools first
- ▸ The approach scales structural diversity along two controllable axes
- ▸ DIVE improves out-of-distribution generalization by 22 average points across 9 benchmarks
Merits
Effective Diversity Scaling
DIVE's approach to scaling diversity leads to significant improvements in out-of-distribution generalization
Grounding by Construction
The method provides grounding by construction, ensuring that tasks are executable and verifiable
Demerits
Limited Tool Coverage
The study only covers 373 tools in five domains, which may not be representative of all possible tool types and use cases
Computational Complexity
The approach may require significant computational resources to execute and reverse-derive tasks
Expert Commentary
The article presents a significant contribution to the field of natural language processing and artificial intelligence. The proposed DIVE approach offers a novel solution to the challenge of scaling diversity in agentic task synthesis, which is essential for achieving robust generalization in large language models. The results demonstrate the effectiveness of the approach, and the study's findings have important implications for the development of more robust and adaptable AI systems. However, further research is needed to address the limitations of the study and to explore the potential applications of the DIVE approach in various domains.
Recommendations
- ✓ Future studies should investigate the applicability of the DIVE approach to other domains and tasks
- ✓ The development of more efficient and scalable methods for executing and reverse-deriving tasks is necessary to reduce computational complexity