Academic

CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents

arXiv:2603.10577v1 Announce Type: new Abstract: Computer-Use Agents (CUAs) are emerging as a new paradigm in human-computer interaction, enabling autonomous execution of tasks in desktop environment by perceiving high-level natural-language instructions. As such agents become increasingly capable and are deployed across diverse desktop environments, evaluating their behavior in a scalable and reliable manner becomes a critical challenge. Existing evaluation pipelines rely on static benchmarks, rule-based success checks, or manual inspection, which are brittle, costly, and poorly aligned with real-world usage. In this work, we study Vision-Language Models (VLMs) as autonomous auditors for assessing CUA task completion directly from observable interactions and conduct a large-scale meta-evaluation of five VLMs that judge task success given a natural-language instruction and the final environment state. Our evaluation spans three widely used CUA benchmarks across macOS, Windows, and Linu

M
Marta Sumyk, Oleksandr Kosovan
· · 1 min read · 15 views

arXiv:2603.10577v1 Announce Type: new Abstract: Computer-Use Agents (CUAs) are emerging as a new paradigm in human-computer interaction, enabling autonomous execution of tasks in desktop environment by perceiving high-level natural-language instructions. As such agents become increasingly capable and are deployed across diverse desktop environments, evaluating their behavior in a scalable and reliable manner becomes a critical challenge. Existing evaluation pipelines rely on static benchmarks, rule-based success checks, or manual inspection, which are brittle, costly, and poorly aligned with real-world usage. In this work, we study Vision-Language Models (VLMs) as autonomous auditors for assessing CUA task completion directly from observable interactions and conduct a large-scale meta-evaluation of five VLMs that judge task success given a natural-language instruction and the final environment state. Our evaluation spans three widely used CUA benchmarks across macOS, Windows, and Linux environments and analyzes auditor behavior along three complementary dimensions: accuracy, calibration of confidence estimates, and inter-model agreement. We find that while state-of-the-art VLMs achieve strong accuracy and calibration, all auditors exhibit notable performance degradation in more complex or heterogeneous environments, and even high-performing models show significant disagreement in their judgments. These results expose fundamental limitations of current model-based auditing approaches and highlight the need to explicitly account for evaluator reliability, uncertainty, and variance when deploying autonomous CUAs in real-world settings.

Executive Summary

The article CUAAudit evaluates Vision-Language Models (VLMs) as auditors of autonomous Computer-Use Agents (CUAs), which execute tasks via natural-language instructions. Given the scalability challenges in evaluating CUAs, the authors assess five VLMs across three benchmarks (macOS, Windows, Linux) in a meta-evaluation focused on accuracy, confidence calibration, and inter-model agreement. While VLMs demonstrate strong accuracy and calibration, the study reveals consistent performance degradation in complex or heterogeneous environments and significant inter-model disagreement. These findings expose critical limitations in current model-based auditing methodologies and underscore the necessity to incorporate evaluator reliability, uncertainty, and variance into deployment strategies for real-world CUAs.

Key Points

  • VLMs show strong accuracy and calibration as auditors of CUAs
  • Performance degradation occurs in complex or heterogeneous environments
  • Inter-model disagreement is significant even among high-performing models

Merits

Innovative Audit Framework

The study introduces a novel meta-evaluation approach using VLMs as auditors, offering a scalable alternative to traditional benchmarks or manual inspections.

Demerits

Limitation in Heterogeneous Environments

Current VLMs exhibit performance degradation when applied to complex or heterogeneous desktop environments, indicating a scalability bottleneck.

Expert Commentary

This work represents a meaningful step toward bridging the gap between autonomous agent evaluation and real-world deployment. The authors rightly identify that while VLMs offer a promising alternative to conventional evaluation pipelines, their limitations—particularly in heterogeneous environments—cannot be overlooked. The inter-model disagreement phenomenon is particularly noteworthy; it suggests that current VLMs lack sufficient contextual robustness or interpretability to uniformly assess task completion across diverse user environments. This is a critical insight for both academic researchers and industry stakeholders. Moving forward, integrating uncertainty quantification and robustness testing into VLM-based auditing models will be essential. Additionally, hybrid auditing frameworks combining model-based assessments with human-in-the-loop validation may offer a more reliable path toward scalable, trustworthy autonomous agent deployment. The article rightly shifts the conversation from evaluation as a binary success/failure metric to a nuanced, probabilistic assessment grounded in observable evidence.

Recommendations

  • 1. Incorporate uncertainty quantification mechanisms into VLM-based auditing frameworks.
  • 2. Develop hybrid auditing models that combine automated VLM evaluations with targeted human verification for high-stakes or ambiguous cases.

Sources