AgriPestDatabase-v1.0: A Structured Insect Dataset for Training Agricultural Large Language Model
arXiv:2603.22777v1 Announce Type: new Abstract: Agricultural pest management increasingly relies on timely and accurate access to expert knowledge, yet high quality labeled data and continuous expert support remain limited, particularly for farmers operating in rural regions with unstable/no internet connectivity. At the same time, the rapid growth of AI and LLMs has created new opportunities to deliver practical decision support tools directly to end users in agriculture through compact and deployable systems. This work addresses (i) generating a structured insect information dataset, and (ii) adapting a lightweight LLM model ($\leq$ 7B) by fine tuning it for edge device uses in agricultural pest management. The textual data collection was done by reviewing and collecting information from available pest databases and published manuscripts on nine selected pest species. These structured reports were then reviewed and validated by a domain expert. From these reports, we constructed Q/A
arXiv:2603.22777v1 Announce Type: new Abstract: Agricultural pest management increasingly relies on timely and accurate access to expert knowledge, yet high quality labeled data and continuous expert support remain limited, particularly for farmers operating in rural regions with unstable/no internet connectivity. At the same time, the rapid growth of AI and LLMs has created new opportunities to deliver practical decision support tools directly to end users in agriculture through compact and deployable systems. This work addresses (i) generating a structured insect information dataset, and (ii) adapting a lightweight LLM model ($\leq$ 7B) by fine tuning it for edge device uses in agricultural pest management. The textual data collection was done by reviewing and collecting information from available pest databases and published manuscripts on nine selected pest species. These structured reports were then reviewed and validated by a domain expert. From these reports, we constructed Q/A pairs to support model training and evaluation. A LoRA-based fine-tuning approach was applied to multiple lightweight LLMs and evaluated. Initial evaluation shows that Mistral 7B achieves an 88.9\% pass rate on the domain-specific Q/A task, substantially outperforming Qwen 2.5 7B (63.9\%), and LLaMA 3.1 8B (58.7\%). Notably, Mistral demonstrates higher semantic alignment (embedding similarity: 0.865) despite lower lexical overlap (BLEU: 0.097), indicating that semantic understanding and robust reasoning are more predictive of task success than surface-level conformity in specialized domains. By combining expert organized data, well-structured Q/A pairs, semantic quality control, and efficient model adaptation, this work contributes towards providing support for farmer facing agricultural decision support tools and demonstrates the feasibility of deploying compact, high-performing language models for practical field-level pest management guidance.
Executive Summary
The article introduces AgriPestDatabase-v1.0, a structured insect dataset tailored for training lightweight LLMs to support agricultural pest management. Addressing a critical gap in rural access to expert knowledge, the authors combine expert-curated data with fine-tuned LLMs (<7B parameters) via LoRA. The dataset is derived from validated pest reports, converted into Q/A pairs for training. Evaluation reveals superior performance of Mistral 7B (88.9% pass rate) over Qwen 2.5 and LLaMA 3.1, particularly in semantic alignment despite lower lexical overlap. This work demonstrates the viability of compact, domain-adapted LLMs for edge-device deployment in agriculture, leveraging expert validation and semantic quality control.
Key Points
- ▸ Structured insect dataset for pest management
- ▸ Fine-tuning lightweight LLMs (<7B) via LoRA
- ▸ Semantic understanding outperforms lexical conformity in domain-specific tasks
Merits
Expert-Curated Data Quality
Domain-validated Q/A pairs ensure accuracy and relevance for agricultural contexts.
Performance Advantage
Mistral 7B’s superior pass rate (88.9%) signals strong domain adaptation potential for edge use.
Demerits
Data Scope Limitation
Dataset covers only nine pest species; scalability to broader insect taxonomy may require significant expansion effort.
Expert Commentary
This paper strikes a critical balance between data rigor and technological pragmatism. The integration of expert validation into Q/A curation is a hallmark of methodological integrity, particularly in high-stakes domains like pest management. The semantic alignment metric (0.865) as a proxy for reasoning quality is a sophisticated and necessary innovation—it moves beyond surface metrics like BLEU to capture cognitive fidelity, which is essential for decision-making in agriculture. Moreover, the choice of Mistral 7B over larger models like LLaMA 3.1 highlights a pragmatic shift: performance is no longer dictated by model size alone, but by contextual adaptation and interpretability. This work may catalyze a broader trend toward 'precision AI' in agriculture—where compact, semantically aligned models replace resource-intensive alternatives without compromising efficacy. It also raises important questions about the ethics of deploying AI in rural settings: if models are validated by experts but not continuously updated, how do we mitigate drift in pest identification? These considerations must inform future iterations.
Recommendations
- ✓ 1. Expand the AgriPestDatabase to include at least 20 pest species and integrate multilingual support for low-resource language communities.
- ✓ 2. Develop a modular fine-tuning framework that allows domain experts to iteratively update Q/A pairs via mobile interfaces, enabling real-time adaptation without requiring re-training at scale.
Sources
Original: arXiv - cs.AI