Academic

Dynin-Omni: Omnimodal Unified Large Diffusion Language Model

arXiv:2604.00007v1 Announce Type: cross Abstract: We present Dynin-Omni, the first masked-diffusion-based omnimodal foundation model that unifies text, image, and speech understanding and generation, together with video understanding, within a single architecture. Unlike autoregressive unified models that serialize heterogeneous modalities, or compositional unified models that require orchestration with external modality-specific decoders, Dynin-Omni natively formulates omnimodal modeling as masked diffusion over a shared discrete token space, enabling iterative refinement under bidirectional context. Dynin-Omni adopts a multi-stage training strategy with model-merging-based modality expansion and omnimodal alignment. We evaluate Dynin-Omni across 19 multimodal benchmarks spanning language reasoning, image generation and editing, video understanding, and speech recognition and synthesis. Dynin-Omni achieves 87.6 on GSM8K, 1733.6 on MME-P, 61.4 on VideoMME, 0.87 on GenEval, and 2.1 WER

arXiv:2604.00007v1 Announce Type: cross Abstract: We present Dynin-Omni, the first masked-diffusion-based omnimodal foundation model that unifies text, image, and speech understanding and generation, together with video understanding, within a single architecture. Unlike autoregressive unified models that serialize heterogeneous modalities, or compositional unified models that require orchestration with external modality-specific decoders, Dynin-Omni natively formulates omnimodal modeling as masked diffusion over a shared discrete token space, enabling iterative refinement under bidirectional context. Dynin-Omni adopts a multi-stage training strategy with model-merging-based modality expansion and omnimodal alignment. We evaluate Dynin-Omni across 19 multimodal benchmarks spanning language reasoning, image generation and editing, video understanding, and speech recognition and synthesis. Dynin-Omni achieves 87.6 on GSM8K, 1733.6 on MME-P, 61.4 on VideoMME, 0.87 on GenEval, and 2.1 WER on LibriSpeech test-clean, consistently outperforming existing open-source unified models while remaining competitive with strong modality-specific expert systems. These results demonstrate the potential of masked diffusion as a unified paradigm for any-to-any modeling, providing a flexible foundation for real-time omnimodal systems, unified cross-modal retrieval and generation, and embodied multimodal agents.

Executive Summary

The article introduces Dynin-Omni, a groundbreaking masked-diffusion-based omnimodal foundation model that unifies text, image, speech, and video modalities within a single architecture. Unlike prior unified models that rely on autoregressive or compositional approaches, Dynin-Omni leverages masked diffusion over a shared discrete token space, enabling iterative refinement under bidirectional context. The model employs a multi-stage training strategy with modality expansion and omnimodal alignment, and demonstrates state-of-the-art performance across 19 multimodal benchmarks, including language reasoning (GSM8K), image generation (GenEval), video understanding (VideoMME), and speech recognition (LibriSpeech). The results highlight the potential of masked diffusion as a unified paradigm for any-to-any modeling, with implications for real-time omnimodal systems, cross-modal retrieval, and embodied agents.

Key Points

  • Dynin-Omni is the first masked-diffusion-based omnimodal foundation model, integrating text, image, speech, and video modalities within a single architecture.
  • The model formulates omnimodal modeling as masked diffusion over a shared discrete token space, enabling iterative refinement under bidirectional context, unlike autoregressive or compositional approaches.
  • Dynin-Omni achieves state-of-the-art performance across 19 multimodal benchmarks, outperforming open-source unified models while remaining competitive with modality-specific expert systems.

Merits

Paradigm Innovation

Dynin-Omni introduces a novel masked-diffusion approach to omnimodal modeling, addressing limitations of autoregressive and compositional methods by enabling iterative refinement under bidirectional context in a unified architecture.

Broad Modal Coverage

The model unifies text, image, speech, and video modalities within a single framework, demonstrating versatility and scalability across diverse tasks, from language reasoning to video understanding.

Performance Superiority

Dynin-Omni achieves top-tier results on 19 multimodal benchmarks, outperforming existing open-source unified models and remaining competitive with modality-specific expert systems, showcasing its robustness and generalizability.

Demerits

Computational Complexity

The masked-diffusion approach may introduce higher computational overhead compared to autoregressive models, particularly during the iterative refinement process, which could limit real-time deployability without optimizations.

Training and Alignment Challenges

The multi-stage training strategy with modality expansion and omnimodal alignment is complex and resource-intensive, requiring careful orchestration to ensure effective knowledge transfer and modality fusion.

Discrete Tokenization Limitations

The reliance on discrete token spaces for omnimodal modeling may constrain the model's ability to capture fine-grained, continuous features in modalities like speech and video, potentially impacting performance in high-fidelity generation tasks.

Expert Commentary

Dynin-Omni represents a significant leap forward in the unification of multimodal AI, challenging the dominance of autoregressive and compositional approaches with a masked-diffusion paradigm. The model's ability to iteratively refine omnimodal outputs under bidirectional context is particularly noteworthy, as it aligns with emerging trends in bidirectional transformers while addressing their limitations in handling heterogeneous modalities. However, the computational and training complexities of this approach cannot be understated; the multi-stage training and modality alignment processes demand substantial resources and expertise. From a legal and policy perspective, the model's omnimodal capabilities raise critical questions about liability in cross-modal applications, such as AI-generated content that spans text, image, and audio. Additionally, the discrete tokenization strategy, while innovative, may introduce trade-offs in the fidelity of continuous modalities like speech and video. Overall, Dynin-Omni sets a new benchmark for omnimodal foundation models, but its real-world impact will depend on advancements in efficiency, ethical safeguards, and regulatory frameworks.

Recommendations

  • Conduct further research to optimize the computational efficiency of masked-diffusion models, particularly for real-time applications, through techniques like model distillation, quantization, or adaptive inference.
  • Develop standardized benchmarks and evaluation protocols for omnimodal models to ensure fair comparisons and facilitate progress in the field, particularly in addressing ethical and safety concerns.

Sources

Original: arXiv - cs.AI