Academic

OmniACBench: A Benchmark for Evaluating Context-Grounded Acoustic Control in Omni-Modal Models

arXiv:2603.23938v1 Announce Type: new Abstract: Most testbeds for omni-modal models assess multimodal understanding via textual outputs, leaving it unclear whether these models can properly speak their answers. To study this, we introduce OmniACBench, a benchmark for evaluating context-grounded acoustic control in omni-modal models. Given a spoken instruction, a text script, and an image, a model must read the script aloud with an appropriate tone and manner. OmniACBench comprises 3,559 verified instances covering six acoustic features: speech rate, phonation, pronunciation, emotion, global accent, and timbre. Extensive experiments on eight models reveal their limitations in the proposed setting, despite their strong performance on prior textual-output evaluations. Our analyses show that the main bottleneck lies not in processing individual modalities, but in integrating multimodal context for faithful speech generation. Moreover, we identify three common failure modes-weak direct con

arXiv:2603.23938v1 Announce Type: new Abstract: Most testbeds for omni-modal models assess multimodal understanding via textual outputs, leaving it unclear whether these models can properly speak their answers. To study this, we introduce OmniACBench, a benchmark for evaluating context-grounded acoustic control in omni-modal models. Given a spoken instruction, a text script, and an image, a model must read the script aloud with an appropriate tone and manner. OmniACBench comprises 3,559 verified instances covering six acoustic features: speech rate, phonation, pronunciation, emotion, global accent, and timbre. Extensive experiments on eight models reveal their limitations in the proposed setting, despite their strong performance on prior textual-output evaluations. Our analyses show that the main bottleneck lies not in processing individual modalities, but in integrating multimodal context for faithful speech generation. Moreover, we identify three common failure modes-weak direct control, failed implicit inference, and failed multimodal grounding-providing insights for developing models that can verbalize responses effectively.

Executive Summary

This article presents OmniACBench, a benchmark for evaluating context-grounded acoustic control in omni-modal models. The benchmark assesses a model's ability to read a script aloud with the appropriate tone and manner, given a spoken instruction and an image. Extensive experiments on eight models reveal their limitations in this setting, highlighting the need for multimodal context integration for faithful speech generation. The analysis identifies three common failure modes and provides insights for developing models that can effectively verbalize responses.

Key Points

  • Introduction of OmniACBench, a benchmark for evaluating context-grounded acoustic control in omni-modal models
  • Assessment of eight models' limitations in the proposed setting
  • Identification of three common failure modes: weak direct control, failed implicit inference, and failed multimodal grounding

Merits

Strengths of the research

The article provides a comprehensive evaluation of omni-modal models' acoustic control capabilities, highlighting the need for multimodal context integration and identifying common failure modes.

Methodological rigor

The authors conduct extensive experiments on eight models, using a large dataset of 3,559 verified instances to assess the models' performance.

Demerits

Limitation of the research

The article focuses on a specific aspect of omni-modal models, acoustic control, and may not provide a comprehensive understanding of these models' capabilities.

Scalability of the benchmark

The article does not discuss the scalability of OmniACBench, making it unclear how the benchmark can be applied to larger or more complex datasets.

Expert Commentary

The article presents a significant contribution to the field of multimodal learning, highlighting the need for more research on acoustic control in omni-modal models. The authors' focus on a specific aspect of these models is a strength, as it provides a comprehensive evaluation of their capabilities. However, the article's limitations, such as the lack of discussion on scalability, should be addressed in future research. The implications of this study have the potential to significantly impact the development of policies and regulations governing the use of omni-modal models.

Recommendations

  • Recommendation 1: Future research should focus on developing more advanced multimodal learning models that can integrate context from multiple sources to perform tasks such as speech generation.
  • Recommendation 2: The authors should extend their research to explore the scalability of OmniACBench and its application to larger or more complex datasets.

Sources

Original: arXiv - cs.CL