Academic

Temporal Text Classification with Large Language Models

arXiv:2603.11295v1 Announce Type: new Abstract: Languages change over time. Computational models can be trained to recognize such changes enabling them to estimate the publication date of texts. Despite recent advancements in Large Language Models (LLMs), their performance on automatic dating of texts, also known as Temporal Text Classification (TTC), has not been explored. This study provides the first systematic evaluation of leading proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) LLMs on TTC using three historical corpora, two in English and one in Portuguese. We test zero-shot and few-shot prompting, and fine-tuning settings. Our results indicate that proprietary models perform well, especially with few-shot prompting. They also indicate that fine-tuning substantially improves open-source models but that they still fail to match the performance delivered by proprietary LLMs.

N
Nishat Raihan, Marcos Zampieri
· · 1 min read · 20 views

arXiv:2603.11295v1 Announce Type: new Abstract: Languages change over time. Computational models can be trained to recognize such changes enabling them to estimate the publication date of texts. Despite recent advancements in Large Language Models (LLMs), their performance on automatic dating of texts, also known as Temporal Text Classification (TTC), has not been explored. This study provides the first systematic evaluation of leading proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) LLMs on TTC using three historical corpora, two in English and one in Portuguese. We test zero-shot and few-shot prompting, and fine-tuning settings. Our results indicate that proprietary models perform well, especially with few-shot prompting. They also indicate that fine-tuning substantially improves open-source models but that they still fail to match the performance delivered by proprietary LLMs.

Executive Summary

This study pioneers a systematic evaluation of Temporal Text Classification (TTC) using leading proprietary and open-source Large Language Models (LLMs). For the first time, researchers assessed proprietary models (Claude 3.5, GPT-4o, Gemini 1.5) and open-source variants (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) on TTC across three historical corpora—two in English and one in Portuguese. The evaluation incorporated zero-shot, few-shot, and fine-tuning configurations. Findings reveal that proprietary LLMs outperform open-source models, particularly when few-shot prompting is employed, with fine-tuning significantly enhancing open-source models but still falling short of proprietary performance. This work fills a critical gap in understanding LLM capabilities in temporal analysis and informs future research and application development in historical text dating.

Key Points

  • First systematic evaluation of TTC using LLMs
  • Proprietary models outperform open-source models in TTC
  • Fine-tuning improves open-source models but not to proprietary levels

Merits

Innovation

Pioneers empirical validation of TTC with LLMs, establishing a benchmark for future studies in temporal text analysis.

Comparative Analysis

Provides a nuanced comparison between proprietary and open-source LLMs, offering insights into relative strengths under varying prompting strategies.

Demerits

Scope Limitation

The study’s reliance on a limited number of corpora and languages may constrain generalizability to broader linguistic or temporal contexts.

Performance Gap

While fine-tuning enhances open-source models, the persistent performance gap with proprietary models may deter adoption in resource-constrained environments.

Expert Commentary

This paper represents a significant advancement in the intersection of temporal analysis and LLMs. The methodological rigor—evaluating zero-shot, few-shot, and fine-tuning across multiple languages—demonstrates a sophisticated understanding of both the technical and practical challenges. The results are particularly compelling because they validate a widely held intuition: proprietary models, due to their superior training data and architectural refinement, excel in domain-specific tasks like temporal classification. However, the study also quietly acknowledges a broader concern: the sustainability of open-source AI in specialized applications. While fine-tuning offers a path forward for open-source models, the persistent performance differential may necessitate a reevaluation of open-source investment models. Moreover, the implications extend beyond academia: industries relying on temporal text analysis—such as legal, archival, or media sectors—will need to adapt procurement strategies to align with empirical performance data. Ultimately, this work bridges a critical gap and sets a new standard for evaluating LLMs in temporal tasks.

Recommendations

  • Future research should expand corpus diversity and include multilingual, longitudinal datasets to validate findings beyond current scope.
  • Open-source communities should prioritize targeted fine-tuning frameworks and augment dataset quality to bridge performance gaps with proprietary models.

Sources