Academic

RACER: Risk-Aware Calibrated Efficient Routing for Large Language Models

arXiv:2603.06616v1 Announce Type: new Abstract: Efficiently routing queries to the optimal large language model (LLM) is crucial for optimizing the cost-performance trade-off in multi-model systems. However, most existing routers rely on single-model selection, making them susceptible to misrouting. In this work, we formulate LLM routing as the $\alpha$-VOR problem to minimize expected set size while controlling the misrouting risk, and propose a novel method -- RACER, extending base routers to output model sets that can be subsequently aggregated for improved output. In particular, RACER constructs nested model sets via augmented scoring and utilizes finite-sample concentration bounds to calibrate a threshold that allows for both variable set sizes and abstention. We theoretically prove that RACER achieves rigorous distribution-free risk control on unseen test data in a post-hoc and model-agnostic manner. Extensive experiments verify our theoretical guarantees and demonstrate that RA

S
Sai Hao, Hao Zeng, Hongxin Wei, Bingyi Jing
· · 1 min read · 8 views

arXiv:2603.06616v1 Announce Type: new Abstract: Efficiently routing queries to the optimal large language model (LLM) is crucial for optimizing the cost-performance trade-off in multi-model systems. However, most existing routers rely on single-model selection, making them susceptible to misrouting. In this work, we formulate LLM routing as the $\alpha$-VOR problem to minimize expected set size while controlling the misrouting risk, and propose a novel method -- RACER, extending base routers to output model sets that can be subsequently aggregated for improved output. In particular, RACER constructs nested model sets via augmented scoring and utilizes finite-sample concentration bounds to calibrate a threshold that allows for both variable set sizes and abstention. We theoretically prove that RACER achieves rigorous distribution-free risk control on unseen test data in a post-hoc and model-agnostic manner. Extensive experiments verify our theoretical guarantees and demonstrate that RACER consistently enhances downstream accuracy across a wide range of benchmarks.

Executive Summary

The article introduces RACER, a novel method for routing queries to large language models (LLMs) that minimizes expected set size while controlling misrouting risk. RACER constructs nested model sets via augmented scoring and utilizes finite-sample concentration bounds to calibrate a threshold, allowing for variable set sizes and abstention. Theoretical guarantees and extensive experiments demonstrate RACER's effectiveness in enhancing downstream accuracy across various benchmarks.

Key Points

  • Formulation of LLM routing as the α-VOR problem
  • Introduction of RACER, a method for routing queries to LLMs
  • Theoretical proof of RACER's distribution-free risk control

Merits

Improved Accuracy

RACER consistently enhances downstream accuracy across a wide range of benchmarks

Risk Control

RACER achieves rigorous distribution-free risk control on unseen test data

Demerits

Complexity

RACER's augmented scoring and finite-sample concentration bounds may add complexity to the routing process

Expert Commentary

RACER represents a significant advancement in the field of LLM routing, offering a robust and flexible approach to minimizing misrouting risk. The method's ability to construct nested model sets and calibrate thresholds using finite-sample concentration bounds is particularly noteworthy. As the use of LLMs continues to grow, RACER's implications for improving accuracy and controlling risk will be of increasing importance. However, further research is needed to fully explore the potential applications and limitations of this approach.

Recommendations

  • Further experimentation to explore the scalability of RACER in large-scale multi-model systems
  • Investigation into the potential applications of RACER in other domains, such as computer vision and natural language processing

Sources