Academic

PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark

arXiv:2603.14456v1 Announce Type: new Abstract: Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, sugges

Mohammad Javad Ranjbar Kalahroodi, Mohammad Amini, Parmis Bathayan, Heshaam Faili, Azadeh Shakery · March 17, 2026 · 1 min read · 9 views

#cs.CL #cs.SD

Executive Summary

This study introduces PARSA-Bench, the first comprehensive benchmark for evaluating large audio-language models on the Persian language and culture. The benchmark comprises 16 tasks, including speech understanding, paralinguistic analysis, and cultural audio understanding. Notably, text-only baselines outperform audio counterparts, suggesting models may not leverage audio-specific information beyond transcription. The culturally-grounded tasks expose a distinct failure mode in prosodic perception, with models performing near random chance on vazn detection. The study highlights the need for culturally-sensitive benchmarks and audio-specific models to advance language understanding.

Key Points

▸ PARSA-Bench is the first comprehensive benchmark for Persian language and culture
▸ Text-only baselines outperform audio counterparts on several tasks
▸ Culturally-grounded tasks expose a distinct failure mode in prosodic perception

Merits

Strength

The study introduces a culturally-sensitive benchmark that addresses the unique challenges of the Persian language and culture, providing a valuable resource for advancing language understanding.

Methodological rigor

The study employs a comprehensive evaluation framework, comprising 16 tasks and over 8,000 samples, ensuring a robust assessment of audio-language models.

Demerits

Limitation

The study focuses on a single language and culture, limiting its generalizability to other languages and cultural contexts.

Technical challenge

The study highlights the technical challenge of developing audio-specific models that can leverage culturally-sensitive information, which remains an open research problem.

Expert Commentary

This study makes a significant contribution to the field of audio-based language understanding by introducing a culturally-sensitive benchmark that addresses the unique challenges of the Persian language and culture. The study's findings on the limitations of current audio-language models highlight the need for further research on developing culturally-sensitive models that can leverage culturally-sensitive information. The study's methodology and evaluation framework are rigorous and comprehensive, providing a valuable resource for advancing language understanding. However, the study's focus on a single language and culture limits its generalizability, and the technical challenge of developing audio-specific models remains an open research problem.

Recommendations

✓ Future research should focus on developing culturally-sensitive models that can leverage culturally-sensitive information, using benchmarks like PARSA-Bench as a starting point.
✓ The development of culturally-sensitive AI applications should be prioritized, particularly in regions with diverse languages and cultures.

Sources

arXiv - cs.CL

PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark

AI Commentary

Executive Summary

Key Points

Merits

Strength

Methodological rigor

Demerits

Limitation

Technical challenge

Expert Commentary

Recommendations

Sources

Related Articles

AI-Driven Approaches to Enhancing Fairness and Identifying Algorithmic Bias in …

High resolution schemes for hyperbolic conservation laws

Robust Graph Representation Learning via Adaptive Spectral Contrast

Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via …

JCG, PC

HSOLLC Co., Ltd.