Academic

Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs

arXiv:2603.11495v1 Announce Type: new Abstract: Tool-calling empowers Large Language Models (LLMs) to interact with external environments. However, current methods often struggle to handle massive and noisy candidate tools in long-context tool-calling tasks, limiting their real-world application. To this end, we propose Tool-DC, a Divide-and-Conquer framework for boosting tool-calling performance of LLMs. The core of Tool-DC is to reduce the reasoning difficulty and make full use of self-reflection ability of LLMs via a "Try-Check-Retry" paradigm. Specifically, Tool-DC involves two variants: 1) the training-free Tool-DC (TF), which is plug-and-play and flexible; 2) the training-based Tool-DC (TB), which is more inference-efficient. Extensive experiments show that both Tool-DC methods outperform their counterparts by a clear margin. Tool-DC (TF) brings up to +25.10% average gains against the baseline on BFCL and ACEBench benchmarks, while Tool-DC (TB) enables Qwen2.5-7B to achieve comp

K
Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du, Dacheng Tao
· · 1 min read · 12 views

arXiv:2603.11495v1 Announce Type: new Abstract: Tool-calling empowers Large Language Models (LLMs) to interact with external environments. However, current methods often struggle to handle massive and noisy candidate tools in long-context tool-calling tasks, limiting their real-world application. To this end, we propose Tool-DC, a Divide-and-Conquer framework for boosting tool-calling performance of LLMs. The core of Tool-DC is to reduce the reasoning difficulty and make full use of self-reflection ability of LLMs via a "Try-Check-Retry" paradigm. Specifically, Tool-DC involves two variants: 1) the training-free Tool-DC (TF), which is plug-and-play and flexible; 2) the training-based Tool-DC (TB), which is more inference-efficient. Extensive experiments show that both Tool-DC methods outperform their counterparts by a clear margin. Tool-DC (TF) brings up to +25.10% average gains against the baseline on BFCL and ACEBench benchmarks, while Tool-DC (TB) enables Qwen2.5-7B to achieve comparable or even better performance than proprietary LLMs, e.g., OpenAI o3 and Claude-Haiku-4.5.

Executive Summary

The article 'Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs' proposes a novel framework, Tool-DC, to enhance the performance of Large Language Models (LLMs) in long-context tool-calling tasks. The framework employs a 'Try-Check-Retry' paradigm, reducing reasoning difficulty and leveraging self-reflection abilities of LLMs. Two variants of Tool-DC are presented: training-free (TF) and training-based (TB). Experimental results demonstrate significant improvements over baseline models, with TF achieving up to +25.10% average gains and TB rivaling proprietary LLMs. This research has the potential to improve the practical applications of LLMs in real-world scenarios.

Key Points

  • Tool-DC is a novel framework for boosting LLM performance in long-context tool-calling tasks
  • The framework employs a 'Try-Check-Retry' paradigm to reduce reasoning difficulty and leverage self-reflection
  • Two variants of Tool-DC are presented: training-free (TF) and training-based (TB)
  • Experimental results demonstrate significant improvements over baseline models

Merits

Strength

The framework's ability to reduce reasoning difficulty and leverage self-reflection abilities of LLMs is a significant merit.

Demerits

Limitation

The framework's reliance on the quality of pre-trained LLMs may limit its generalizability to other tasks and domains.

Expert Commentary

The article presents a significant contribution to the field of LLM research, particularly in the context of long-context tool-calling tasks. The proposed framework, Tool-DC, is well-designed and effectively leverages the self-reflection abilities of LLMs to improve performance. The experimental results are convincing, and the framework's potential to improve practical applications of LLMs is substantial. However, the framework's reliance on pre-trained LLMs may limit its generalizability to other tasks and domains. Future research should focus on addressing this limitation and exploring the framework's applicability to other areas of LLM research.

Recommendations

  • Further research should focus on addressing the framework's reliance on pre-trained LLMs and exploring its applicability to other tasks and domains.
  • The proposed framework should be evaluated on a broader range of benchmarks and tasks to demonstrate its generalizability and robustness.

Sources