Academic

Adapting Text LLMs to Speech via Multimodal Depth Up-Scaling

arXiv:2604.00489v1 Announce Type: new Abstract: Adapting pre-trained text Large Language Models (LLMs) into Speech Language Models (Speech LMs) via continual pretraining on speech data is promising, but often degrades the original text capabilities. We propose Multimodal Depth Upscaling, an extension of an emerging strategy in continual LLM pre-training, where new transformer layers are inserted into a frozen text LLM and only the added layers are trained on speech data. Experiments with SmolLM2-360M and SmolLM2-1.7B on 48k hours of English Automatic Speech Recognition (ASR) data show that depth up-scaling achieves ASR comparable to full fine-tuning while causing far less text degradation than both full fine-tuning and Low-Rank Adaptation (LoRA). We further show that incorporating E-Branchformer, an architecture designed for speech recognition, as the inserted layers achieves ASR that matches or surpasses full fine-tuning on the larger model while reducing text degradation by over 75%

K
Kazuki Yano, Jun Suzuki, Shinji Watanabe
· · 1 min read · 1 views

arXiv:2604.00489v1 Announce Type: new Abstract: Adapting pre-trained text Large Language Models (LLMs) into Speech Language Models (Speech LMs) via continual pretraining on speech data is promising, but often degrades the original text capabilities. We propose Multimodal Depth Upscaling, an extension of an emerging strategy in continual LLM pre-training, where new transformer layers are inserted into a frozen text LLM and only the added layers are trained on speech data. Experiments with SmolLM2-360M and SmolLM2-1.7B on 48k hours of English Automatic Speech Recognition (ASR) data show that depth up-scaling achieves ASR comparable to full fine-tuning while causing far less text degradation than both full fine-tuning and Low-Rank Adaptation (LoRA). We further show that incorporating E-Branchformer, an architecture designed for speech recognition, as the inserted layers achieves ASR that matches or surpasses full fine-tuning on the larger model while reducing text degradation by over 75% with 60% fewer trainable parameters.

Executive Summary

The article presents a novel method, Multimodal Depth Upscaling, for adapting pre-trained text Large Language Models (LLMs) into Speech Language Models (Speech LMs) while mitigating the degradation of original text capabilities. By inserting new transformer layers into a frozen text LLM and training only these layers on speech data, the approach achieves Automatic Speech Recognition (ASR) performance comparable to full fine-tuning with significantly less text degradation. Experiments with SmolLM2 models demonstrate that incorporating the E-Branchformer architecture as the inserted layers further enhances ASR performance, particularly in larger models, while reducing text degradation by over 75% and using 60% fewer trainable parameters. This method offers a promising balance between performance and parameter efficiency in multimodal LLM adaptation.

Key Points

  • Multimodal Depth Upscaling involves inserting new transformer layers into a frozen text LLM and training only these layers on speech data, preserving the original text capabilities better than full fine-tuning or LoRA.
  • Experiments with SmolLM2-360M and SmolLM2-1.7B on 48k hours of English ASR data show that this method achieves ASR performance comparable to full fine-tuning with far less text degradation.
  • Incorporating the E-Branchformer architecture as the inserted layers achieves ASR that matches or surpasses full fine-tuning in larger models, while reducing text degradation by over 75% and using 60% fewer trainable parameters.

Merits

Preservation of Original Capabilities

The method significantly reduces the degradation of text capabilities compared to full fine-tuning and LoRA, addressing a critical challenge in adapting text LLMs to speech modalities.

Parameter Efficiency

By training only the inserted layers, the approach achieves competitive ASR performance with far fewer trainable parameters, enhancing computational efficiency.

Performance Competitiveness

The integration of E-Branchformer as the inserted layers not only matches but can surpass full fine-tuning in ASR performance, particularly in larger models, demonstrating the method's scalability and effectiveness.

Scalability and Adaptability

The approach is adaptable to different model sizes and architectures, indicating its potential for broader applications in multimodal LLM development.

Demerits

Limited Generalizability

The experiments are conducted primarily on English ASR data with specific models (SmolLM2). The method's effectiveness across other languages, domains, or model architectures remains untested.

Dependency on Architectural Choices

The performance gains are closely tied to the use of E-Branchformer as the inserted layer architecture. The method's efficacy with alternative architectures is not explored.

Training Data Requirements

The approach relies on extensive speech data (48k hours) for continual pre-training, which may limit its accessibility for researchers or organizations with limited computational resources or data availability.

Potential Overfitting

Training only the inserted layers on speech data may risk overfitting to the specific speech tasks or datasets, potentially limiting generalization to unseen speech inputs.

Expert Commentary

The authors present a compelling and rigorous approach to addressing a longstanding challenge in multimodal AI: adapting text-centric LLMs to speech modalities without sacrificing original capabilities. The method of Multimodal Depth Upscaling, particularly when combined with E-Branchformer, represents a significant advancement in parameter-efficient transfer learning. The experimental results demonstrate not only competitive ASR performance but also a substantial reduction in text degradation, which is critical for maintaining the utility of pre-trained models in real-world applications. The study’s focus on scalability—showing benefits in both small and large models—is particularly noteworthy, as it suggests the method’s potential applicability across a wide range of model sizes and architectures. However, the reliance on extensive speech data and the untested generalizability beyond English and specific model architectures warrant caution. Future work could explore the method’s performance across diverse languages, domains, and alternative inserted layer architectures to fully realize its potential. Additionally, the ethical implications of such adaptation techniques, particularly concerning bias and fairness in ASR systems, remain an open area for further research. Overall, this work is a valuable contribution to the field, offering practical insights for both academia and industry while highlighting new avenues for exploration.

Recommendations

  • Conduct further experiments to evaluate the generalizability of Multimodal Depth Upscaling across diverse languages, domains, and alternative model architectures to validate its robustness and scalability.
  • Explore hybrid training strategies that combine depth upscaling with other parameter-efficient methods (e.g., LoRA, adapters) to further optimize performance and reduce computational costs.
  • Investigate the ethical and bias implications of adapting text LLMs to speech tasks, particularly in terms of fairness across linguistic and demographic groups, and develop mitigation strategies accordingly.
  • Assess the potential of Multimodal Depth Upscaling in low-resource settings, including scenarios with limited speech data, to broaden its applicability and accessibility.
  • Collaborate with industry stakeholders to integrate the method into real-world applications, such as voice assistants or transcription services, and evaluate its performance in production environments.

Sources

Original: arXiv - cs.CL