Adaptive Vision-Language Model Routing for Computer Use Agents
arXiv:2603.12823v1 Announce Type: new Abstract: Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large
arXiv:2603.12823v1 Announce Type: new Abstract: Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost--accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78\% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: https://github.com/vllm-project/semantic-router.
Executive Summary
This article proposes Adaptive VLM Routing (AVR), a framework that optimizes the routing of natural-language instructions to Vision-Language Models (VLMs) in Computer Use Agents (CUAs). AVR inserts a lightweight semantic routing layer that estimates action difficulty, probes model confidence, and routes actions to the cheapest model meeting a target accuracy threshold. The framework reduces inference costs by up to 78% while maintaining accuracy within 2 percentage points of an all-large-model baseline. The authors also demonstrate the effectiveness of AVR when combined with the Visual Confused Deputy guardrail, which escalates high-risk actions to the strongest available model. The framework has the potential to improve the efficiency and safety of CUAs in various applications, including screen scraping and automated testing.
Key Points
- ▸ AVR proposes a lightweight semantic routing layer to optimize VLM routing in CUAs.
- ▸ The framework estimates action difficulty, probes model confidence, and routes actions to the cheapest model meeting a target accuracy threshold.
- ▸ AVR reduces inference costs by up to 78% while maintaining accuracy within 2 percentage points of an all-large-model baseline.
Merits
Improved Efficiency
AVR's routing framework significantly reduces inference costs by up to 78%, making it an attractive solution for applications where computational resources are limited.
Enhanced Safety
The combination of AVR with the Visual Confused Deputy guardrail enhances the safety of CUAs by escalating high-risk actions to the strongest available model.
Scalability
AVR's framework is designed to handle large-scale applications, making it a scalable solution for various industries.
Demerits
Complexity
The addition of a lightweight semantic routing layer may introduce complexity to the CUA system, requiring additional development and maintenance efforts.
Dependence on Model Quality
AVR's performance relies heavily on the quality of the VLMs used, which may be a limitation if the models are not well-trained or validated.
Limited Generalizability
The effectiveness of AVR may be limited to specific applications or domains, requiring further evaluation and adaptation for broader use cases.
Expert Commentary
The proposed Adaptive VLM Routing framework has the potential to revolutionize the field of Computer Use Agents by providing a more efficient and safe solution for interpreting natural-language instructions. However, the framework's effectiveness relies heavily on the quality of the VLMs used, and further evaluation is necessary to ensure its generalizability across various applications. Additionally, the complexity of the framework may introduce challenges for development and maintenance. Nevertheless, the authors' demonstration of AVR's potential to reduce inference costs by up to 78% while maintaining accuracy within 2 percentage points of an all-large-model baseline is impressive. Overall, the article provides a valuable contribution to the field, highlighting the need for more sophisticated AI techniques in CUAs and the potential benefits of AVR in improving the efficiency and safety of these systems.
Recommendations
- ✓ Future research should focus on evaluating AVR's generalizability across various applications and domains.
- ✓ The development of more advanced VLMs and routing frameworks is necessary to further improve the efficiency and safety of CUAs.