Academic

Adaptive Vision-Language Model Routing for Computer Use Agents

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen · March 16, 2026 · 1 min read · 7 views

#cs.CL #cs.CV

arXiv:2603.12823v1 Announce Type: new Abstract: Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost--accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78\% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: https://github.com/vllm-project/semantic-router.

Executive Summary

This article proposes Adaptive VLM Routing (AVR), a framework that optimizes the routing of natural-language instructions to Vision-Language Models (VLMs) in Computer Use Agents (CUAs). AVR inserts a lightweight semantic routing layer that estimates action difficulty, probes model confidence, and routes actions to the cheapest model meeting a target accuracy threshold. The framework reduces inference costs by up to 78% while maintaining accuracy within 2 percentage points of an all-large-model baseline. The authors also demonstrate the effectiveness of AVR when combined with the Visual Confused Deputy guardrail, which escalates high-risk actions to the strongest available model. The framework has the potential to improve the efficiency and safety of CUAs in various applications, including screen scraping and automated testing.

Key Points

▸ AVR proposes a lightweight semantic routing layer to optimize VLM routing in CUAs.
▸ The framework estimates action difficulty, probes model confidence, and routes actions to the cheapest model meeting a target accuracy threshold.
▸ AVR reduces inference costs by up to 78% while maintaining accuracy within 2 percentage points of an all-large-model baseline.

Merits

Improved Efficiency

AVR's routing framework significantly reduces inference costs by up to 78%, making it an attractive solution for applications where computational resources are limited.

Enhanced Safety

The combination of AVR with the Visual Confused Deputy guardrail enhances the safety of CUAs by escalating high-risk actions to the strongest available model.

Scalability

AVR's framework is designed to handle large-scale applications, making it a scalable solution for various industries.

Demerits

Complexity

The addition of a lightweight semantic routing layer may introduce complexity to the CUA system, requiring additional development and maintenance efforts.

Dependence on Model Quality

AVR's performance relies heavily on the quality of the VLMs used, which may be a limitation if the models are not well-trained or validated.

Limited Generalizability

The effectiveness of AVR may be limited to specific applications or domains, requiring further evaluation and adaptation for broader use cases.

Expert Commentary

The proposed Adaptive VLM Routing framework has the potential to revolutionize the field of Computer Use Agents by providing a more efficient and safe solution for interpreting natural-language instructions. However, the framework's effectiveness relies heavily on the quality of the VLMs used, and further evaluation is necessary to ensure its generalizability across various applications. Additionally, the complexity of the framework may introduce challenges for development and maintenance. Nevertheless, the authors' demonstration of AVR's potential to reduce inference costs by up to 78% while maintaining accuracy within 2 percentage points of an all-large-model baseline is impressive. Overall, the article provides a valuable contribution to the field, highlighting the need for more sophisticated AI techniques in CUAs and the potential benefits of AVR in improving the efficiency and safety of these systems.

Recommendations

✓ Future research should focus on evaluating AVR's generalizability across various applications and domains.
✓ The development of more advanced VLMs and routing frameworks is necessary to further improve the efficiency and safety of CUAs.

Sources

arXiv - cs.CL

Adaptive Vision-Language Model Routing for Computer Use Agents

AI Commentary

Executive Summary

Key Points

Merits

Improved Efficiency

Enhanced Safety

Scalability

Demerits

Complexity

Dependence on Model Quality

Limited Generalizability

Expert Commentary

Recommendations

Sources

Related Articles

Autoencoder-Based Parameter Estimation for Superposed Multi-Component Damped Sinusoidal Signals

Multirate Stein Variational Gradient Descent for Efficient Bayesian Sampling

BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference

JCG, PC

HSOLLC Co., Ltd.