Academic

Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

arXiv:2603.11214v1 Announce Type: new Abstract: We evaluate the autonomous cyber-attack capabilities of frontier AI models on two purpose-built cyber ranges-a 32-step corporate network attack and a 7-step industrial control system attack-that require chaining heterogeneous capabilities across extended action sequences. By comparing seven models released over an eighteen-month period (August 2024 to February 2026) at varying inference-time compute budgets, we observe two capability trends. First, model performance scales log-linearly with inference-time compute, with no observed plateau-increasing from 10M to 100M tokens yields gains of up to 59%, requiring no specific technical sophistication from the operator. Second, each successive model generation outperforms its predecessor at fixed token budgets: on the corporate network range, average steps completed at 10M tokens rose from 1.7 (GPT-4o, August 2024) to 9.8 (Opus 4.6, February 2026). The best single run completed 22 of 32 steps,

arXiv:2603.11214v1 Announce Type: new Abstract: We evaluate the autonomous cyber-attack capabilities of frontier AI models on two purpose-built cyber ranges-a 32-step corporate network attack and a 7-step industrial control system attack-that require chaining heterogeneous capabilities across extended action sequences. By comparing seven models released over an eighteen-month period (August 2024 to February 2026) at varying inference-time compute budgets, we observe two capability trends. First, model performance scales log-linearly with inference-time compute, with no observed plateau-increasing from 10M to 100M tokens yields gains of up to 59%, requiring no specific technical sophistication from the operator. Second, each successive model generation outperforms its predecessor at fixed token budgets: on the corporate network range, average steps completed at 10M tokens rose from 1.7 (GPT-4o, August 2024) to 9.8 (Opus 4.6, February 2026). The best single run completed 22 of 32 steps, corresponding to roughly 6 of the estimated 14 hours a human expert would need. On the industrial control system range, performance remains limited, though the most recent models are the first to reliably complete steps, averaging 1.2-1.4 of 7 (max 3).

Executive Summary

This article presents a comprehensive evaluation of the autonomous cyber-attack capabilities of frontier AI models on two purpose-built cyber ranges, a 32-step corporate network attack and a 7-step industrial control system attack. The authors observe two key trends: (1) model performance scales log-linearly with inference-time compute, with no observed plateau, and (2) each successive model generation outperforms its predecessor at fixed token budgets. While the results demonstrate significant progress in AI agents' capabilities, the authors also note limitations in performance on the industrial control system range. The study's findings have important implications for the development and deployment of AI-powered cyber defense systems. The authors' approach provides a valuable framework for evaluating the capabilities and limitations of AI models in complex cyber attack scenarios.

Key Points

  • Frontier AI models demonstrate significant progress in autonomous cyber-attack capabilities on purpose-built cyber ranges.
  • Model performance scales log-linearly with inference-time compute, with no observed plateau.
  • Each successive model generation outperforms its predecessor at fixed token budgets.

Merits

Strength in methodology

The authors employ a rigorous approach, using multiple models and varying inference-time compute budgets to evaluate the capabilities of frontier AI models.

Use of purpose-built cyber ranges

The use of purpose-built cyber ranges provides a controlled and realistic environment for evaluating the capabilities of AI models in complex cyber attack scenarios.

Quantification of performance gains

The authors provide quantitative measures of performance gains, allowing for a clear understanding of the progress made in AI agents' capabilities.

Demerits

Limited generalizability

The study's findings may not generalize to other types of cyber attacks or scenarios, limiting the applicability of the results.

Dependence on inference-time compute

The authors' observation that model performance scales log-linearly with inference-time compute raises concerns about the dependence of AI agents' capabilities on computational resources.

Expert Commentary

The study's findings are significant, demonstrating the potential of AI models to perform complex cyber attacks. However, the authors' observation that each successive model generation outperforms its predecessor at fixed token budgets raises important questions about the limitations and potential risks associated with these technologies. The study's focus on evaluating the capabilities of AI models also highlights the need for ongoing research and development in the area of AI and cybersecurity. To maximize the benefits of AI-powered cyber defense systems, it will be essential to address the challenges and limitations identified in this study, including the dependence on inference-time compute and the limited generalizability of the results.

Recommendations

  • Further research is needed to address the limitations and challenges identified in the study, including the dependence on inference-time compute and the limited generalizability of the results.
  • Policymakers and regulatory bodies should re-evaluate their approaches to cybersecurity and AI, taking into account the potential benefits and limitations of AI-powered cyber defense systems.

Sources