Academic

Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics

arXiv:2603.22709v1 Announce Type: new Abstract: Conversational automatic speech recognition remains challenging due to overlapping speech, far-field noise, and varying speaker counts. While recent LLM-based systems perform well on single-speaker benchmarks, their robustness in multi-speaker settings is unclear. We systematically compare LLM-based and modular pipeline approaches along four axes: overlap robustness, semantic fidelity, speaker count, and single- versus multi-channel input. To capture meaning-altering errors that conventional metrics miss, we introduce tcpSemER, which extends tcpWER by replacing Levenshtein distance with embedding-based semantic similarity. We further decompose tcpWER into overlapping and non-overlapping components for finer-grained analysis. Experiments across three datasets show that LLM-based systems are competitive in two-speaker settings but degrade as speaker count and overlap increase, whereas modular pipelines remain more robust.

arXiv:2603.22709v1 Announce Type: new Abstract: Conversational automatic speech recognition remains challenging due to overlapping speech, far-field noise, and varying speaker counts. While recent LLM-based systems perform well on single-speaker benchmarks, their robustness in multi-speaker settings is unclear. We systematically compare LLM-based and modular pipeline approaches along four axes: overlap robustness, semantic fidelity, speaker count, and single- versus multi-channel input. To capture meaning-altering errors that conventional metrics miss, we introduce tcpSemER, which extends tcpWER by replacing Levenshtein distance with embedding-based semantic similarity. We further decompose tcpWER into overlapping and non-overlapping components for finer-grained analysis. Experiments across three datasets show that LLM-based systems are competitive in two-speaker settings but degrade as speaker count and overlap increase, whereas modular pipelines remain more robust.

Executive Summary

This article presents a rigorous comparative analysis of LLM-based and modular pipeline approaches in conversational ASR under conditions of overlapping speech, noise, and variable speaker counts. Using newly introduced tcpSemER and decomposed tcpWER metrics, the study quantifies semantic fidelity and overlap robustness across three datasets. The findings reveal that while LLM-based systems perform competitively in two-speaker scenarios, their performance degrades significantly with increased overlap and speaker count, whereas modular pipelines exhibit greater resilience. The paper’s methodological innovation—embedding-based semantic similarity as a proxy for meaning alteration—adds substantive value to the ASR evaluation literature.

Key Points

  • LLM systems are effective in two-speaker settings but degrade with increased overlap and speaker count
  • tcpSemER introduces semantic-aware evaluation beyond conventional WER
  • Modular pipelines demonstrate greater robustness in multi-speaker, overlapping environments

Merits

Methodological Innovation

The introduction of tcpSemER and decomposition of tcpWER into overlapping/non-overlapping components provides a more nuanced, meaning-aware assessment of ASR accuracy.

Demerits

Limited Scope

Experiments are confined to specific datasets; generalizability to broader real-world environments (e.g., diverse acoustic conditions, non-English languages) remains unaddressed.

Expert Commentary

The study represents a significant advancement in the evaluation of conversational ASR, particularly in its ability to quantify semantic degradation through embedding-based metrics. The distinction between LLM and modular pipeline performance under varying overlap and speaker conditions is not merely quantitative—it is qualitatively meaningful. LLM models, while powerful in controlled, single-speaker contexts, appear to inherit limitations from their training corpora that manifest under real-world acoustic complexities. Modular pipelines, though less flexible, offer a more predictable, stable architecture for critical applications where accuracy under uncertainty is paramount. The paper’s contribution lies not only in the metrics but in the conceptual shift toward evaluating ASR systems by their resilience to semantic and acoustic perturbations. This shift is essential as ASR moves from lab-bench to field deployment. Future work should extend this framework to cross-lingual, multi-modal, and low-resource settings to validate scalability.

Recommendations

  • 1. Adopt tcpSemER or similar semantic-aware metrics as standard evaluation tools in ASR benchmarking for multi-speaker scenarios.
  • 2. Encourage industry and academic consortia to develop standardized datasets with controlled overlap, speaker count, and channel variability to enable reproducible comparisons between pipeline architectures.

Sources

Original: arXiv - cs.CL