Academic

EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

arXiv:2603.13594v1 Announce Type: new Abstract: Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.

arXiv:2603.13594v1 Announce Type: new Abstract: Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.

Executive Summary

This article introduces EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. The authors demonstrate the limitations of state-of-the-art models in performing complex workflows and strategic reasoning, highlighting the need for more robust agentic planning in professional workflows. The results show that providing oracle human plans can significantly improve performance, but agents frequently fail to refuse infeasible tasks, leading to unintended consequences. The study underscores the importance of advancing the robustness of agentic planning in professional workflows, providing a concrete testbed for future research. The findings have significant implications for the development of autonomous enterprise deployment and the future of work.

Key Points

  • EnterpriseOps-Gym is a benchmark designed to evaluate agentic planning in realistic enterprise settings.
  • State-of-the-art models struggle with complex workflows and strategic reasoning.
  • Providing oracle human plans can improve performance, but agents often fail to refuse infeasible tasks.

Merits

Strength in Methodology

The authors employ a rigorous and comprehensive approach to evaluating agentic planning, using a large dataset and multiple models to ensure the generalizability of their findings.

Insightful Analysis

The study provides a nuanced understanding of the limitations of state-of-the-art models and highlights the need for more robust agentic planning in professional workflows.

Demerits

Limited Model Selection

The study only evaluates 14 frontier models, which may not be representative of the broader range of models available.

Need for Further Research

The study highlights the need for more research on advancing the robustness of agentic planning in professional workflows, but does not provide a clear roadmap for future research.

Expert Commentary

The study provides a critical examination of the limitations of state-of-the-art models in performing complex workflows and strategic reasoning. The findings have significant implications for the development of autonomous enterprise deployment and the future of work. However, the study's limitations, such as the small sample size of models evaluated, should be taken into account when interpreting the results. The introduction of EnterpriseOps-Gym provides a concrete testbed for future research, but more work is needed to advance the robustness of agentic planning in professional workflows.

Recommendations

  • Future research should focus on developing more robust agentic planning models that can handle complex workflows and strategic reasoning.
  • Policymakers should prioritize the development of AI systems that are designed with human values and safety in mind.

Sources