Academic

Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records

arXiv:2603.06836v1 Announce Type: new Abstract: Background: Recent studies have demonstrated that large language models (LLMs) can perform binary classification tasks on child welfare narratives, detecting the presence or absence of constructs such as substance-related problems, domestic violence, and firearms involvement. Whether smaller, locally deployable models can move beyond binary detection to classify specific substance types from these narratives remains untested. Objective: To validate a locally hosted LLM classifier for identifying specific substance types aligned with DSM-5 categories in child welfare investigation narratives. Methods: A locally hosted 20-billion-parameter LLM classified child maltreatment investigation narratives from a Midwestern U.S. state. Records previously identified as containing substance-related problems were passed to a second classification stage targeting seven DSM-5 substance categories. Expert human review of 900 stratified cases assessed cla

arXiv:2603.06836v1 Announce Type: new Abstract: Background: Recent studies have demonstrated that large language models (LLMs) can perform binary classification tasks on child welfare narratives, detecting the presence or absence of constructs such as substance-related problems, domestic violence, and firearms involvement. Whether smaller, locally deployable models can move beyond binary detection to classify specific substance types from these narratives remains untested. Objective: To validate a locally hosted LLM classifier for identifying specific substance types aligned with DSM-5 categories in child welfare investigation narratives. Methods: A locally hosted 20-billion-parameter LLM classified child maltreatment investigation narratives from a Midwestern U.S. state. Records previously identified as containing substance-related problems were passed to a second classification stage targeting seven DSM-5 substance categories. Expert human review of 900 stratified cases assessed classification precision, recall, and inter-method reliability (Cohen's kappa). Test-retest stability was evaluated using approximately 15,000 independently classified records. Results: Five substance categories achieved almost perfect inter-method agreement (kappa = 0.94-1.00): alcohol, cannabis, opioid, stimulant, and sedative/hypnotic/anxiolytic. Classification precision ranged from 92% to 100% for these categories. Two low-prevalence categories (hallucinogen, inhalant) performed poorly. Test-retest agreement ranged from 92.1% to 99.1% across the seven categories. Conclusions: A small, locally hosted LLM can reliably classify substance types from child welfare administrative text, extending prior work on binary classification to multi-label substance identification.

Executive Summary

This study validates the use of a small language model (LLM) for classifying specific substance types from child welfare investigation narratives, aligning with DSM-5 categories. The LLM demonstrated high performance in five substance categories, with almost perfect inter-method agreement and classification precision. However, two low-prevalence categories performed poorly. The study's findings suggest that small, locally hosted LLMs can be a reliable tool for substance identification in child welfare settings. The results extend prior work on binary classification to multi-label substance identification, with implications for the efficient and accurate processing of large datasets.

Key Points

  • The study validates the use of a small LLM for substance classification in child welfare narratives.
  • The LLM achieved high performance in five substance categories.
  • Two low-prevalence categories performed poorly, highlighting the need for further refinement.

Merits

Strength

The study demonstrates the feasibility of using a small LLM for substance classification in child welfare settings, offering a cost-effective and efficient solution for large datasets.

Methodological Rigor

The study employs a robust methodology, including expert human review and test-retest stability evaluation, to validate the LLM's performance.

Demerits

Limitation

The study's findings may not be generalizable to other populations or datasets, highlighting the need for further research to validate the LLM's performance across diverse contexts.

Underlying Assumptions

The study assumes that the LLM's performance is directly applicable to real-world child welfare settings, which may not be the case due to differences in data quality, context, and complexity.

Expert Commentary

The study's findings are significant, as they demonstrate the feasibility of using a small LLM for substance classification in child welfare settings. The study's methodology is robust, and the results are promising. However, the study's limitations should be acknowledged, including the potential for bias in the training data and the need for further research to validate the LLM's performance across diverse contexts. Additionally, the study's findings highlight the importance of efficient and accurate data analysis in child welfare settings, with implications for policy development and program evaluation.

Recommendations

  • Future research should focus on validating the LLM's performance across diverse contexts, including different populations, datasets, and settings.
  • The study's findings should be replicated and expanded to other areas of child welfare, including mental health, education, and social work.

Sources