Academic

Stepwise: Neuro-Symbolic Proof Search for Automated Systems Verification

Baoding He, Zenan Li, Wei Sun, Yuan Yao, Taolue Chen, Xiaoxing Ma, Zhendong Su · March 23, 2026 · 1 min read · 9 views

#cs.AI

arXiv:2603.19715v1 Announce Type: new Abstract: Formal verification via interactive theorem proving is increasingly used to ensure the correctness of critical systems, yet constructing large proof scripts remains highly manual and limits scalability. Advances in large language models (LLMs), especially in mathematical reasoning, make their integration into software verification increasingly promising. This paper introduces a neuro-symbolic proof generation framework designed to automate proof search for systems-level verification projects. The framework performs a best-first tree search over proof states, repeatedly querying an LLM for the next candidate proof step. On the neural side, we fine-tune LLMs using datasets of proof state-step pairs; on the symbolic side, we incorporate a range of ITP tools to repair rejected steps, filter and rank proof states, and automatically discharge subgoals when search progress stalls. This synergy enables data-efficient LLM adaptation and semantics-informed pruning of the search space. We implement the framework on a new Isabelle REPL that exposes fine-grained proof states and automation tools, and evaluate it on the FVEL seL4 benchmark and additional Isabelle developments. On seL4, the system proves up to 77.6\% of the theorems, substantially surpassing previous LLM-based approaches and standalone Sledgehammer, while solving significantly more multi-step proofs. Results across further benchmarks demonstrate strong generalization, indicating a viable path toward scalable automated software verification.

Executive Summary

This article presents a neuro-symbolic proof generation framework, Stepwise, designed to automate proof search for systems-level verification projects. By combining large language models (LLMs) with interactive theorem proving (ITP) tools, the framework enables data-efficient LLM adaptation and semantics-informed pruning of the search space. The authors demonstrate the effectiveness of Stepwise on the FVEL seL4 benchmark and additional Isabelle developments, achieving significantly better results compared to previous LLM-based approaches and standalone Sledgehammer. The results indicate a viable path toward scalable automated software verification. However, the framework's scalability and generalizability to diverse verification tasks and domains remain to be explored.

Key Points

▸ Stepwise combines LLMs with ITP tools for automated proof search
▸ The framework enables data-efficient LLM adaptation and semantics-informed pruning
▸ Stepwise achieves significantly better results compared to previous LLM-based approaches and Sledgehammer

Merits

Strength in LLM adaptation

The framework's ability to adapt LLMs efficiently enables robust performance in automated proof search.

Scalability and generalizability

Stepwise demonstrates strong generalization across diverse verification tasks and domains, indicating a viable path toward scalable automated software verification.

Improved proof search efficiency

The framework's synergy between LLMs and ITP tools enables semantics-informed pruning of the search space, leading to improved proof search efficiency.

Demerits

Scalability and generalizability limitations

While Stepwise achieves significant improvements, the framework's scalability and generalizability to diverse verification tasks and domains remain to be explored.

Dependence on specific LLMs and ITP tools

The framework's performance may be sensitive to the choice of LLMs and ITP tools, which could impact its adaptability and robustness.

Expert Commentary

The article presents a promising approach to automating proof search for systems-level verification projects. The synergy between LLMs and ITP tools enables data-efficient LLM adaptation and semantics-informed pruning of the search space. While the framework's scalability and generalizability to diverse verification tasks and domains remain to be explored, the results indicate a viable path toward scalable automated software verification. The implications of this work are significant, as it may lead to improved software quality, safety, and security. However, further research is needed to fully realize the potential of Stepwise and to address its limitations.

Recommendations

✓ Recommendation 1: The authors should explore the framework's scalability and generalizability to diverse verification tasks and domains.
✓ Recommendation 2: Further research is needed to investigate the sensitivity of Stepwise's performance to the choice of LLMs and ITP tools.

Sources

Original: arXiv - cs.AI

arXiv - cs.AI

Stepwise: Neuro-Symbolic Proof Search for Automated Systems Verification

AI Commentary

Executive Summary

Key Points

Merits

Strength in LLM adaptation

Scalability and generalizability

Improved proof search efficiency

Demerits

Scalability and generalizability limitations

Dependence on specific LLMs and ITP tools

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.