VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs
arXiv:2603.08936v1 Announce Type: cross Abstract: Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective d
arXiv:2603.08936v1 Announce Type: cross Abstract: Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions.
Executive Summary
The article presents VoxEmo, a comprehensive benchmark for speech emotion recognition (SER) using speech large language models (LLMs). VoxEmo addresses the limitations of conventional SER benchmarks by incorporating 35 emotion corpora across 15 languages and a standardized toolkit featuring varying prompt complexities. The authors introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy to reflect real-world perception and annotator disagreement. Experiments demonstrate that zero-shot speech LLMs trail supervised baselines in hard-label accuracy but align with human subjective distributions. This study highlights the potential of speech LLMs for SER and emphasizes the need for more nuanced evaluation metrics.
Key Points
- ▸ VoxEmo is a comprehensive benchmark for SER using speech LLMs
- ▸ The benchmark features 35 emotion corpora across 15 languages and a standardized toolkit with varying prompt complexities
- ▸ The authors introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy to reflect real-world perception and annotator disagreement
Merits
Strength in Methodology
The study's comprehensive approach, incorporating multiple emotion corpora and varying prompt complexities, provides a robust evaluation framework for SER using speech LLMs.
Insight into Human Subjective Distributions
The study's findings highlight the potential of speech LLMs to align with human subjective distributions, which is a significant advancement in the field of SER.
Demerits
Limited Generalizability
The study's focus on a specific benchmark and evaluation protocol may limit its generalizability to other SER applications and evaluation frameworks.
Need for Further Investigation
The study's results, while promising, should be further investigated and validated in real-world scenarios to fully understand the potential of speech LLMs for SER.
Expert Commentary
The article presents a significant contribution to the field of SER, highlighting the potential of speech LLMs to improve accuracy and empathy in real-world applications. However, the study's limitations, such as the need for further investigation and validation in real-world scenarios, should be carefully considered. The emphasis on aligning with human subjective distributions is a crucial aspect of designing more human-centered and empathetic interfaces, and the study's findings have significant implications for the development and deployment of SER systems in real-world applications.
Recommendations
- ✓ Future studies should investigate the generalizability of VoxEmo to other SER applications and evaluation frameworks.
- ✓ The development and deployment of SER systems in real-world applications should prioritize aligning with human subjective distributions to ensure more accurate and empathetic interactions.