Academic

Large Language Models for Missing Data Imputation: Understanding Behavior, Hallucination Effects, and Control Mechanisms

arXiv:2603.22332v1 Announce Type: new Abstract: Data imputation is a cornerstone technique for handling missing values in real-world datasets, which are often plagued by missingness. Despite recent progress, prior studies on Large Language Models-based imputation remain limited by scalability challenges, restricted cross-model comparisons, and evaluations conducted on small or domain-specific datasets. Furthermore, heterogeneous experimental protocols and inconsistent treatment of missingness mechanisms (MCAR, MAR, and MNAR) hinder systematic benchmarking across methods. This work investigates the robustness of Large Language Models for missing data imputation in tabular datasets using a zero-shot prompt engineering approach. To this end, we present a comprehensive benchmarking study comparing five widely used LLMs against six state-of-the-art imputation baselines. The experimental design evaluates these methods across 29 datasets (including nine synthetic datasets) under MCAR, MAR, a

arXiv:2603.22332v1 Announce Type: new Abstract: Data imputation is a cornerstone technique for handling missing values in real-world datasets, which are often plagued by missingness. Despite recent progress, prior studies on Large Language Models-based imputation remain limited by scalability challenges, restricted cross-model comparisons, and evaluations conducted on small or domain-specific datasets. Furthermore, heterogeneous experimental protocols and inconsistent treatment of missingness mechanisms (MCAR, MAR, and MNAR) hinder systematic benchmarking across methods. This work investigates the robustness of Large Language Models for missing data imputation in tabular datasets using a zero-shot prompt engineering approach. To this end, we present a comprehensive benchmarking study comparing five widely used LLMs against six state-of-the-art imputation baselines. The experimental design evaluates these methods across 29 datasets (including nine synthetic datasets) under MCAR, MAR, and MNAR mechanisms, with missing rates of up to 20\%. The results demonstrate that leading LLMs, particularly Gemini 3.0 Flash and Claude 4.5 Sonnet, consistently achieve superior performance on real-world open-source datasets compared to traditional methods. However, this advantage appears to be closely tied to the models' prior exposure to domain-specific patterns learned during pre-training on internet-scale corpora. In contrast, on synthetic datasets, traditional methods such as MICE outperform LLMs, suggesting that LLM effectiveness is driven by semantic context rather than purely statistical reconstruction. Furthermore, we identify a clear trade-off: while LLMs excel in imputation quality, they incur significantly higher computational time and monetary costs. Overall, this study provides a large-scale comparative analysis, positioning LLMs as promising semantics-driven imputers for complex tabular data.

Executive Summary

This article presents a comprehensive benchmarking study to evaluate the performance of Large Language Models (LLMs) for missing data imputation in tabular datasets. The study compares five widely used LLMs against six state-of-the-art imputation baselines across 29 datasets, including nine synthetic datasets, under various missing data mechanisms. The results show that leading LLMs, particularly Gemini 3.0 Flash and Claude 4.5 Sonnet, achieve superior performance on real-world open-source datasets, but their effectiveness is driven by semantic context rather than purely statistical reconstruction. However, LLMs incur significantly higher computational time and monetary costs. The study highlights the trade-offs between imputation quality, computational efficiency, and monetary costs, positioning LLMs as promising semantics-driven imputers for complex tabular data.

Key Points

  • Large Language Models (LLMs) achieve superior performance on real-world open-source datasets for missing data imputation
  • LLMs' effectiveness is driven by semantic context rather than purely statistical reconstruction
  • LLMs incur significantly higher computational time and monetary costs compared to traditional methods

Merits

Strength in Semantics-Driven Imputation

LLMs demonstrate superior performance on real-world datasets, leveraging semantic context to drive imputation quality

Comprehensive Benchmarking Study

The study evaluates LLMs and traditional imputation methods across 29 datasets, providing a robust assessment of their performance

Demerits

High Computational Costs

LLMs incur significantly higher computational time and monetary costs compared to traditional methods, limiting their practical application

Dependence on Domain-Specific Patterns

LLMs' performance is closely tied to their prior exposure to domain-specific patterns learned during pre-training, limiting their generalizability

Expert Commentary

This study provides a significant contribution to the field of missing data imputation, shedding light on the strengths and limitations of Large Language Models. While LLMs demonstrate superior performance on real-world datasets, their high computational costs and dependence on domain-specific patterns limit their practical application. The study's findings have important implications for data governance policies and regulations, emphasizing the need for more robust and transparent data imputation methods. To further improve the performance and trustworthiness of LLMs, it is essential to develop more explainable and interpretable models that can mitigate biases and unfairness in imputed data.

Recommendations

  • Develop more explainable and interpretable LLMs that can mitigate biases and unfairness in imputed data
  • Investigate strategies to reduce computational costs and improve the generalizability of LLMs

Sources

Original: arXiv - cs.LG