Academic

Do What I Say: A Spoken Prompt Dataset for Instruction-Following

arXiv:2603.09881v1 Announce Type: new Abstract: Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based promptin

arXiv:2603.09881v1 Announce Type: new Abstract: Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.

Executive Summary

This article introduces DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to evaluate the performance of Speech Large Language Models (SLLMs) under realistic spoken instruction conditions. The dataset spans 9 tasks and 11 languages, providing 10 prompt variants per task-language pair. Benchmarking state-of-the-art SLLMs using DOWIS reveals that text prompts consistently outperform spoken prompts, except for tasks with speech output. The findings highlight the need for speech-based prompting in SLLM evaluation, particularly for low-resource and cross-lingual settings. The study's results have significant implications for the development and evaluation of SLLMs, which are increasingly being used in real-world applications.

Key Points

  • DoWhatISay (DOWIS) is a multilingual dataset of human-recorded spoken and written prompts for evaluating SLLMs.
  • The dataset spans 9 tasks and 11 languages, providing 10 prompt variants per task-language pair.
  • Text prompts consistently outperform spoken prompts in SLLM evaluation, except for tasks with speech output.

Merits

Strength of Real-World Relevance

The introduction of DOWIS addresses the gap between text-based and spoken prompts, providing a more realistic evaluation of SLLMs in real-world scenarios.

Comprehensive Dataset

DOWIS spans 9 tasks and 11 languages, providing a comprehensive dataset for evaluating SLLMs under various conditions.

Insights into SLLM Evaluation

The study's findings highlight the need for speech-based prompting in SLLM evaluation, particularly for low-resource and cross-lingual settings.

Demerits

Limitation of Task-Specific Results

The study's results may be task-specific, limiting the generalizability of the findings to other tasks and applications.

Potential Bias in Prompt Variants

The use of 10 prompt variants per task-language pair may introduce bias, requiring further analysis to ensure the dataset's reliability.

Expert Commentary

The article's findings have significant implications for the development and evaluation of SLLMs. The introduction of DOWIS addresses a critical gap in SLLM evaluation, providing a more realistic assessment of their performance under spoken instruction conditions. However, the study's limitations, such as task-specific results and potential bias in prompt variants, highlight the need for further research to ensure the reliability and generalizability of the findings. Nevertheless, the study's contributions to the field of SLLM evaluation are substantial, and its implications will likely shape the future development of these models.

Recommendations

  • Future studies should investigate the use of DOWIS in other tasks and applications to further validate its effectiveness.
  • Developers of SLLMs should incorporate speech-based prompting into their models to ensure their effectiveness in real-world scenarios.

Sources