Fine-tuning Whisper for Pashto ASR: strategies and scale
arXiv:2604.06507v1 Announce Type: new Abstract: Pashto is absent from Whisper's pre-training corpus despite being one of CommonVoice's largest language collections, leaving off-the-shelf models unusable: all Whisper sizes output Arabic, Dari, or Urdu script on Pashto audio, achieving word error rates above 100%. We compare four fine-tuning strategies for whisper-base on CommonVoice Pashto v20: vanilla full fine-tuning, LoRA (rank 64), frozen-encoder (2/6 layers), and multistage Urdu-to-Pashto transfer. We extend vanilla fine-tuning to whisper-small and whisper-large-v3-turbo on CommonVoice Pashto v24 (113 hours). Vanilla fine-tuning achieves WER 21.22% on CV20, outperforming LoRA by 33.36 pp, frozen-encoder by 14.76 pp, and Urdu transfer by 44.56 pp. Frozen-encoder fine-tuning degrades performance on whisper-base (6 encoder layers): layer-function separation does not hold at this depth, and freezing removes a third of trainable capacity. Urdu-to-Pashto transfer fails due to an unverif
arXiv:2604.06507v1 Announce Type: new Abstract: Pashto is absent from Whisper's pre-training corpus despite being one of CommonVoice's largest language collections, leaving off-the-shelf models unusable: all Whisper sizes output Arabic, Dari, or Urdu script on Pashto audio, achieving word error rates above 100%. We compare four fine-tuning strategies for whisper-base on CommonVoice Pashto v20: vanilla full fine-tuning, LoRA (rank 64), frozen-encoder (2/6 layers), and multistage Urdu-to-Pashto transfer. We extend vanilla fine-tuning to whisper-small and whisper-large-v3-turbo on CommonVoice Pashto v24 (113 hours). Vanilla fine-tuning achieves WER 21.22% on CV20, outperforming LoRA by 33.36 pp, frozen-encoder by 14.76 pp, and Urdu transfer by 44.56 pp. Frozen-encoder fine-tuning degrades performance on whisper-base (6 encoder layers): layer-function separation does not hold at this depth, and freezing removes a third of trainable capacity. Urdu-to-Pashto transfer fails due to an unverified intermediate checkpoint, phonological mismatch, and insufficient training. On CV24, whisper-small achieves WER 24.89% (2.24 pp over whisper-base at 3.3x parameters); whisper-large-v3-turbo achieves 23.37% (a further 1.52 pp). Diminishing returns indicate whisper-small is the practical optimum at 113 hours. Online augmentation provides 7.25 pp WER benefit over matched training. Error analysis identifies word-final suffix confusion (masculine -ay vs. feminine -a) and retroflex substitutions involving the Pashto-unique consonant /ts/ as dominant failure modes. Fine-tuned checkpoints and evaluation scripts are released on HuggingFace.
Executive Summary
This article critically examines the challenges and strategies for adapting OpenAI's Whisper ASR model for Pashto, a language notably absent from Whisper's original pre-training. The authors demonstrate that off-the-shelf Whisper models are unusable for Pashto, yielding over 100% WER. Through rigorous comparative analysis of fine-tuning strategies on CommonVoice Pashto, the study identifies vanilla full fine-tuning as the most effective approach, significantly outperforming LoRA, frozen-encoder, and Urdu-to-Pashto transfer. While scaling to larger Whisper models (small, large-v3-turbo) offers marginal gains, whisper-small emerges as the practical optimum given diminishing returns. The research highlights the critical role of online augmentation and identifies specific linguistic error modes, offering valuable insights for low-resource ASR development.
Key Points
- ▸ Off-the-shelf Whisper models are entirely ineffective for Pashto, exhibiting WERs exceeding 100% and incorrect script output.
- ▸ Vanilla full fine-tuning significantly outperforms alternative strategies (LoRA, frozen-encoder, Urdu transfer) for Pashto ASR adaptation.
- ▸ Whisper-small is identified as the practical optimum model size for Pashto ASR at 113 hours of training data, offering a balance between performance and computational cost.
- ▸ Online data augmentation provides substantial performance benefits (7.25 pp WER reduction).
- ▸ Dominant failure modes include word-final suffix confusion and retroflex substitutions involving Pashto-unique phonemes.
Merits
Rigorous Comparative Analysis
The paper systematically compares four distinct fine-tuning strategies and three Whisper model sizes, providing robust empirical evidence for their conclusions.
Addressing a Critical Language Gap
It tackles the significant challenge of ASR for Pashto, a major language underserved by existing models, contributing directly to linguistic inclusivity in AI.
Practical and Actionable Insights
The identification of vanilla fine-tuning as optimal, the practical optimum model size, and the benefits of augmentation offer clear guidance for future Pashto ASR development.
Detailed Error Analysis
The specific identification of linguistic failure modes (suffix confusion, retroflex substitutions) provides valuable direction for targeted model improvements and linguistic feature engineering.
Open Science Contribution
The release of fine-tuned checkpoints and evaluation scripts enhances reproducibility and facilitates further research in the field.
Demerits
Limited Exploration of Data Augmentation Techniques
While online augmentation is highlighted, a deeper dive into specific augmentation strategies (e.g., spectral, speed perturbation, noise injection) and their differential impact could have provided richer insights.
Lack of Ablation Study for Frozen-Encoder Strategy
The conclusion about 'layer-function separation' not holding at 'whisper-base' depth is insightful but could be strengthened by an ablation study across different frozen layer configurations.
Insufficient Detail on 'Unverified Intermediate Checkpoint' for Urdu Transfer
The failure of Urdu-to-Pashto transfer is partly attributed to an 'unverified intermediate checkpoint.' More technical detail on this specific issue would be beneficial for future research attempting similar transfer learning.
Absence of Human Evaluation
While WER is a standard metric, human evaluation of transcription quality, especially for nuanced linguistic errors, could offer a more holistic understanding of model performance and user experience.
Expert Commentary
This article represents a significant and methodologically sound contribution to the field of low-resource ASR. The authors' systematic comparison of fine-tuning strategies for Pashto is particularly commendable, offering clear empirical evidence that vanilla full fine-tuning remains a robust baseline, even against more parameter-efficient methods like LoRA. The identification of 'whisper-small' as the practical optimum is a crucial insight for resource-constrained development, balancing performance with computational feasibility. The granular error analysis, pinpointing specific phonological and morphological challenges, moves beyond mere quantitative metrics to provide actionable linguistic insights for future model improvements. While the paper could have delved deeper into the specifics of data augmentation techniques or offered a more comprehensive ablation of the frozen-encoder strategy, its strengths in rigorous comparison, practical guidance, and commitment to open science far outweigh these minor critiques. This work establishes a strong precedent for adapting large pre-trained models to linguistically diverse contexts, advancing the goal of truly inclusive AI.
Recommendations
- ✓ Conduct further research into advanced data augmentation techniques tailored for Pashto phonology and morphology to achieve further WER reductions.
- ✓ Explore multi-lingual fine-tuning strategies that leverage related languages (e.g., Dari, Urdu) more effectively, potentially through joint training or more sophisticated transfer learning architectures, rather than direct checkpoint transfer.
- ✓ Investigate the impact of larger and more diverse Pashto datasets, beyond CommonVoice, on the performance ceiling of Whisper models and the practical optimum size.
- ✓ Implement and evaluate targeted linguistic feature engineering or architectural modifications to specifically address the identified dominant failure modes (suffix confusion, retroflex substitutions).
Sources
Original: arXiv - cs.CL