Academic

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

arXiv:2603.08936v1 Announce Type: cross Abstract: Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective d

Hezhao Zhang, Huang-Cheng Chou, Shrikanth Narayanan, Thomas Hain · March 11, 2026 · 1 min read · 19 views

#cs.SD #cs.AI #cs.CL #cs.MM #eess.AS

Executive Summary

The article presents VoxEmo, a comprehensive benchmark for speech emotion recognition (SER) using speech large language models (LLMs). VoxEmo addresses the limitations of conventional SER benchmarks by incorporating 35 emotion corpora across 15 languages and a standardized toolkit featuring varying prompt complexities. The authors introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy to reflect real-world perception and annotator disagreement. Experiments demonstrate that zero-shot speech LLMs trail supervised baselines in hard-label accuracy but align with human subjective distributions. This study highlights the potential of speech LLMs for SER and emphasizes the need for more nuanced evaluation metrics.

Key Points

▸ VoxEmo is a comprehensive benchmark for SER using speech LLMs
▸ The benchmark features 35 emotion corpora across 15 languages and a standardized toolkit with varying prompt complexities
▸ The authors introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy to reflect real-world perception and annotator disagreement

Merits

Strength in Methodology

The study's comprehensive approach, incorporating multiple emotion corpora and varying prompt complexities, provides a robust evaluation framework for SER using speech LLMs.

Insight into Human Subjective Distributions

The study's findings highlight the potential of speech LLMs to align with human subjective distributions, which is a significant advancement in the field of SER.

Demerits

Limited Generalizability

The study's focus on a specific benchmark and evaluation protocol may limit its generalizability to other SER applications and evaluation frameworks.

Need for Further Investigation

The study's results, while promising, should be further investigated and validated in real-world scenarios to fully understand the potential of speech LLMs for SER.

Expert Commentary

The article presents a significant contribution to the field of SER, highlighting the potential of speech LLMs to improve accuracy and empathy in real-world applications. However, the study's limitations, such as the need for further investigation and validation in real-world scenarios, should be carefully considered. The emphasis on aligning with human subjective distributions is a crucial aspect of designing more human-centered and empathetic interfaces, and the study's findings have significant implications for the development and deployment of SER systems in real-world applications.

Recommendations

✓ Future studies should investigate the generalizability of VoxEmo to other SER applications and evaluation frameworks.
✓ The development and deployment of SER systems in real-world applications should prioritize aligning with human subjective distributions to ensure more accurate and empathetic interactions.

Sources

arXiv - cs.CL

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

AI Commentary

Executive Summary

Key Points

Merits

Strength in Methodology

Insight into Human Subjective Distributions

Demerits

Limited Generalizability

Need for Further Investigation

Expert Commentary

Recommendations

Sources

Related Articles

AI-Driven Approaches to Enhancing Fairness and Identifying Algorithmic Bias in …

High resolution schemes for hyperbolic conservation laws

Robust Graph Representation Learning via Adaptive Spectral Contrast

Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via …

JCG, PC

HSOLLC Co., Ltd.