Academic

Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

arXiv:2603.11578v1 Announce Type: new Abstract: Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. We also introduce Decoder Time Dilation, a mechanism that reduces autoregressive overhead and ensures a balanced training distribution. Additionally, we present a supervised fine-tuning strategy that trains the model to recover from delays, significantly improving the quality-latency trade-off. Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores in both low- and high-latency regimes, outperforming recent baselines.

Roman Koshkin, Jeon Haesung, Lianbo Liu, Hao Shi, Mengjie Zhao, Yusuke Fujita, Yui Sudo · March 13, 2026 · 1 min read · 25 views

#cs.CL

Executive Summary

The article introduces Hikari, a novel end-to-end model for simultaneous speech-to-text translation and transcription, replacing traditional heuristic-based or policy-driven approaches with a probabilistic WAIT token mechanism and Decoder Time Dilation. Hikari demonstrates strong empirical performance, achieving state-of-the-art BLEU scores across multiple language pairs in both latency regimes without compromising quality-latency trade-offs. The proposed mechanisms—particularly the WAIT token and Decoder Time Dilation—offer a scalable, efficient, and adaptive framework for real-time translation systems. The supervised fine-tuning strategy further enhances adaptability under delay conditions, adding robustness to deployment scenarios.

Key Points

▸ Introduction of WAIT token mechanism for causal alignment

Merits

Innovation

Hikari eliminates reliance on heuristics or policies by integrating causal alignment via probabilistic WAIT tokens, presenting a more scalable and transparent architecture.

Performance

Achieves new SOTA BLEU scores in both low- and high-latency regimes across multiple languages, validating efficacy.

Adaptability

Decoder Time Dilation reduces autoregressive overhead and improves training distribution, enhancing generalization.

Demerits

Complexity

The WAIT token mechanism may introduce interpretability challenges for researchers unfamiliar with causal modeling; potential for nuanced tuning requirements.

Generalizability

Evaluation is limited to English-to-Japanese, German, and Russian; broader multilingual applicability remains unproven.

Expert Commentary

Hikari represents a significant conceptual leap in simultaneous translation by fundamentally reimagining the control flow of autoregressive models through causal alignment. The WAIT token mechanism, while conceptually elegant, demands careful calibration to avoid over-optimization artifacts, particularly in low-resource scenarios. Decoder Time Dilation’s impact on training balance is particularly noteworthy—it mitigates the classic problem of sequence-length bias without compromising latency performance. Importantly, the supervised fine-tuning component introduces a pragmatic bridge between research and deployment, addressing a critical gap in real-world system robustness. While the evaluation scope is currently constrained, the framework’s modularity suggests strong potential for extension into low-resource or domain-specific settings. This work exemplifies how theoretical elegance can translate into tangible operational gains, and it sets a new benchmark for evaluating such systems: beyond BLEU, latency-aware evaluation metrics must evolve to capture the full impact of innovations like Hikari.

Recommendations

✓ Researchers should extend Hikari’s architecture to include dynamic batching and adaptive sampling for heterogeneous networks.
✓ Industry stakeholders should evaluate Hikari’s model under edge-device constraints and latency-sensitive use cases (e.g., live captioning, emergency communications) to validate real-world impact.

Sources

arXiv - cs.CL

Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

AI Commentary

Executive Summary

Key Points

Merits

Innovation

Performance

Adaptability

Demerits

Complexity

Generalizability

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs