Academic

BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs

arXiv:2603.11991v1 Announce Type: new Abstract: Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic

I
Ilias Aarab
· · 1 min read · 12 views

arXiv:2603.11991v1 Announce Type: new Abstract: Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross-encoders, embedding models, rerankers and instruction-tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3-Reranker-8B, set a new state-of-the-art with macro F1 = 0.72; (ii) strong embedding models such as GTE-large-en-v1.5 substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) instruction-tuned LLMs at 4--12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross-encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.

Executive Summary

This article introduces BTZSC, a comprehensive benchmark for zero-shot text classification across diverse models and datasets. The authors conduct a systematic comparison across four major model families, including NLI cross-encoders, embedding models, rerankers, and instruction-tuned LLMs. The results show that modern rerankers set a new state-of-the-art, while strong embedding models offer a good trade-off between accuracy and latency. The authors also find that instruction-tuned LLMs achieve competitive performance, but trail specialized rerankers. The study highlights the importance of scaling diverse models and datasets to achieve better performance in zero-shot text understanding. The authors release the BTZSC benchmark and evaluation code publicly to support fair and reproducible progress in this area. Overall, this study provides valuable insights into the strengths and limitations of various models in zero-shot text classification.

Key Points

  • Introduction of BTZSC benchmark for zero-shot text classification
  • Comparison across four major model families
  • Results show modern rerankers set a new state-of-the-art
  • Embedding models offer good trade-off between accuracy and latency
  • Instruction-tuned LLMs achieve competitive performance

Merits

Comprehensive benchmark

The introduction of BTZSC provides a comprehensive benchmark for zero-shot text classification, enabling systematic comparison across diverse models and datasets.

Systematic comparison

The authors conduct a systematic comparison across four major model families, providing valuable insights into the strengths and limitations of each model type.

Results demonstrate state-of-the-art performance

The results show that modern rerankers set a new state-of-the-art, highlighting the potential of these models in zero-shot text understanding.

Demerits

Limited scope

The study focuses on zero-shot text classification and may not be directly applicable to other NLP tasks.

Dependence on specific datasets

The performance of models may vary depending on the specific datasets used, which may limit the generalizability of the results.

Expert Commentary

This study provides a valuable contribution to the field of natural language processing by introducing a comprehensive benchmark for zero-shot text classification. The results demonstrate the potential of modern rerankers and embedding models in achieving state-of-the-art performance in this task. However, the study also highlights the limitations of NLI cross-encoders and instruction-tuned LLMs in zero-shot text classification. To further advance this area, it is essential to develop more effective models and evaluation metrics that can handle the complexities of real-world text data. Additionally, the study highlights the need for more research on the fairness and transparency of AI models in zero-shot text classification, which is critical for ensuring that these models are used responsibly in practical applications.

Recommendations

  • Future studies should focus on developing more effective models for zero-shot text classification that can handle the complexities of real-world text data.
  • Researchers should prioritize the development of evaluation metrics and benchmarks that can accurately assess the performance of AI models in zero-shot text classification.

Sources