JFTA-Bench: Evaluate LLM's Ability of Tracking and Analyzing Malfunctions Using Fault Trees
arXiv:2603.22978v1 Announce Type: new Abstract: In the maintenance of complex systems, fault trees are used to locate problems and provide targeted solutions. To enable fault trees stored as images to be directly processed by large language models, which can assist in tracking and analyzing malfunctions, we propose a novel textual representation of fault trees. Building on it, we construct a benchmark for multi-turn dialogue systems that emphasizes robust interaction in complex environments, evaluating a model's ability to assist in malfunction localization, which contains $3130$ entries and $40.75$ turns per entry on average. We train an end-to-end model to generate vague information to reflect user behavior and introduce long-range rollback and recovery procedures to simulate user error scenarios, enabling assessment of a model's integrated capabilities in task tracking and error recovery, and Gemini 2.5 pro archives the best performance.
arXiv:2603.22978v1 Announce Type: new Abstract: In the maintenance of complex systems, fault trees are used to locate problems and provide targeted solutions. To enable fault trees stored as images to be directly processed by large language models, which can assist in tracking and analyzing malfunctions, we propose a novel textual representation of fault trees. Building on it, we construct a benchmark for multi-turn dialogue systems that emphasizes robust interaction in complex environments, evaluating a model's ability to assist in malfunction localization, which contains $3130$ entries and $40.75$ turns per entry on average. We train an end-to-end model to generate vague information to reflect user behavior and introduce long-range rollback and recovery procedures to simulate user error scenarios, enabling assessment of a model's integrated capabilities in task tracking and error recovery, and Gemini 2.5 pro archives the best performance.
Executive Summary
The article proposes a novel textual representation of fault trees to enable large language models to track and analyze malfunctions in complex systems. A benchmark, JFTA-Bench, is constructed to evaluate the ability of models to assist in malfunction localization through multi-turn dialogue systems. The benchmark contains 3130 entries with an average of 40.75 turns per entry and is used to train an end-to-end model, with Gemini 2.5 achieving the best performance.
Key Points
- ▸ Novel textual representation of fault trees for large language models
- ▸ Construction of JFTA-Bench benchmark for evaluating malfunction localization
- ▸ Introduction of long-range rollback and recovery procedures to simulate user error scenarios
Merits
Comprehensive Benchmark
The JFTA-Bench benchmark provides a comprehensive evaluation of a model's ability to track and analyze malfunctions, with a large number of entries and turns per entry.
Demerits
Limited Generalizability
The performance of the model may not generalize to other complex systems or fault tree representations, limiting the applicability of the proposed approach.
Expert Commentary
The proposed approach has significant potential to improve the efficiency and effectiveness of malfunction localization in complex systems. However, further research is needed to address the limitations and concerns surrounding the use of large language models in this context, including issues related to explainability, transparency, and generalizability. The construction of the JFTA-Bench benchmark is a notable contribution, providing a comprehensive evaluation framework for future research and development.
Recommendations
- ✓ Further research on the generalizability of the proposed approach to other complex systems and fault tree representations
- ✓ Investigation into the explainability and transparency of large language models in malfunction localization
Sources
Original: arXiv - cs.AI