PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark
arXiv:2603.14456v1 Announce Type: new Abstract: Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, sugges
arXiv:2603.14456v1 Announce Type: new Abstract: Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, suggesting prosodic perception remains beyond the reach of current models. The dataset is publicly available at https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench
Executive Summary
This study introduces PARSA-Bench, the first comprehensive benchmark for evaluating large audio-language models on the Persian language and culture. The benchmark comprises 16 tasks, including speech understanding, paralinguistic analysis, and cultural audio understanding. Notably, text-only baselines outperform audio counterparts, suggesting models may not leverage audio-specific information beyond transcription. The culturally-grounded tasks expose a distinct failure mode in prosodic perception, with models performing near random chance on vazn detection. The study highlights the need for culturally-sensitive benchmarks and audio-specific models to advance language understanding.
Key Points
- ▸ PARSA-Bench is the first comprehensive benchmark for Persian language and culture
- ▸ Text-only baselines outperform audio counterparts on several tasks
- ▸ Culturally-grounded tasks expose a distinct failure mode in prosodic perception
Merits
Strength
The study introduces a culturally-sensitive benchmark that addresses the unique challenges of the Persian language and culture, providing a valuable resource for advancing language understanding.
Methodological rigor
The study employs a comprehensive evaluation framework, comprising 16 tasks and over 8,000 samples, ensuring a robust assessment of audio-language models.
Demerits
Limitation
The study focuses on a single language and culture, limiting its generalizability to other languages and cultural contexts.
Technical challenge
The study highlights the technical challenge of developing audio-specific models that can leverage culturally-sensitive information, which remains an open research problem.
Expert Commentary
This study makes a significant contribution to the field of audio-based language understanding by introducing a culturally-sensitive benchmark that addresses the unique challenges of the Persian language and culture. The study's findings on the limitations of current audio-language models highlight the need for further research on developing culturally-sensitive models that can leverage culturally-sensitive information. The study's methodology and evaluation framework are rigorous and comprehensive, providing a valuable resource for advancing language understanding. However, the study's focus on a single language and culture limits its generalizability, and the technical challenge of developing audio-specific models remains an open research problem.
Recommendations
- ✓ Future research should focus on developing culturally-sensitive models that can leverage culturally-sensitive information, using benchmarks like PARSA-Bench as a starting point.
- ✓ The development of culturally-sensitive AI applications should be prioritized, particularly in regions with diverse languages and cultures.