Academic

Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning

arXiv:2603.06505v1 Announce Type: new Abstract: Automatic speech recognition (ASR) has benefited from advances in pretrained speech and language models, yet most systems remain constrained to monolingual settings and short, isolated utterances. While recent efforts in context-aware ASR show promise, two key challenges persist: limited multilingual support and the absence of principled alignment between speech and contextual representations. In this paper, we introduce a context-aware multilingual ASR framework that supports diverse languages and accents while preserving the modularity of pretrained models. Our approach combines a frozen speech encoder and a decoder-only language model via a lightweight projection module, allowing structured context prompts, including dialogue history and biasing words, to guide transcription. To improve interaction between speech and context, we employ a contrastive learning objective that aligns their representations in a shared embedding space. Eval

Y
Yuchen Zhang, Haralambos Mouratidis, Ravi Shekhar
· · 1 min read · 16 views

arXiv:2603.06505v1 Announce Type: new Abstract: Automatic speech recognition (ASR) has benefited from advances in pretrained speech and language models, yet most systems remain constrained to monolingual settings and short, isolated utterances. While recent efforts in context-aware ASR show promise, two key challenges persist: limited multilingual support and the absence of principled alignment between speech and contextual representations. In this paper, we introduce a context-aware multilingual ASR framework that supports diverse languages and accents while preserving the modularity of pretrained models. Our approach combines a frozen speech encoder and a decoder-only language model via a lightweight projection module, allowing structured context prompts, including dialogue history and biasing words, to guide transcription. To improve interaction between speech and context, we employ a contrastive learning objective that aligns their representations in a shared embedding space. Evaluations on over 1,500 hours of real-world conversational speech across 11 languages and 5 English dialects show that contextual input consistently improves recognition quality. Contrastive alignment provides additional gains when applied to different context types, with an overall performance gain of over 5%. These results highlight the importance of both contextual modeling and cross-modal alignment in multilingual ASR.

Executive Summary

This article presents a novel approach to multilingual automatic speech recognition (ASR) by introducing a context-aware framework that supports diverse languages and accents while preserving the modularity of pretrained models. The proposed method combines a frozen speech encoder and a decoder-only language model via a lightweight projection module, enabling structured context prompts to guide transcription. A contrastive learning objective is employed to align speech and context representations in a shared embedding space, resulting in improved recognition quality and a 5% overall performance gain. The study evaluates the approach on over 1,500 hours of real-world conversational speech across 11 languages and 5 English dialects, demonstrating the significance of contextual modeling and cross-modal alignment in multilingual ASR.

Key Points

  • Introduction of a context-aware multilingual ASR framework supporting diverse languages and accents
  • Employment of a contrastive learning objective to align speech and context representations
  • Preservation of modularity in pretrained models through a lightweight projection module

Merits

Improved recognition quality through contextual input

The study demonstrates that contextual input consistently improves recognition quality, highlighting the importance of contextual modeling in multilingual ASR.

Enhanced performance through contrastive alignment

The contrastive learning objective provides additional gains when applied to different context types, resulting in an overall performance gain of over 5%.

Demerits

Limited evaluation on a single task

The study's evaluation is limited to a single task, and it is unclear whether the proposed approach would generalize to other tasks or domains.

Dependence on large amounts of training data

The study's success relies on the availability of large amounts of training data, which may not be feasible in all scenarios.

Expert Commentary

The article presents a significant contribution to the field of multilingual ASR by introducing a novel context-aware framework that supports diverse languages and accents. The proposed approach demonstrates improved recognition quality and a 5% overall performance gain, highlighting the importance of contextual modeling and cross-modal alignment. However, the study's evaluation is limited to a single task, and it is unclear whether the proposed approach would generalize to other tasks or domains. Future work should investigate the robustness and adaptability of the proposed approach in various scenarios.

Recommendations

  • Future research should explore the application of the proposed approach to other tasks and domains to evaluate its generalizability.
  • The development of more effective and inclusive ASR systems should prioritize the consideration of contextual modeling and cross-modal alignment.

Sources