Streaming Translation and Transcription Through Speech-to-Text Causal Alignment
arXiv:2603.11578v1 Announce Type: new Abstract: Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. We also introduce Decoder Time Dilation, a mechanism that reduces autoregressive overhead and ensures a balanced training distribution. Additionally, we present a supervised fine-tuning strategy that trains the model to recover from delays, significantly improving the quality-latency trade-off. Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores in both low- and high-latency regimes, outperforming recent baselines.
arXiv:2603.11578v1 Announce Type: new Abstract: Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. We also introduce Decoder Time Dilation, a mechanism that reduces autoregressive overhead and ensures a balanced training distribution. Additionally, we present a supervised fine-tuning strategy that trains the model to recover from delays, significantly improving the quality-latency trade-off. Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores in both low- and high-latency regimes, outperforming recent baselines.
Executive Summary
The article introduces Hikari, a novel end-to-end model for simultaneous speech-to-text translation and transcription, replacing traditional heuristic-based or policy-driven approaches with a probabilistic WAIT token mechanism and Decoder Time Dilation. Hikari demonstrates strong empirical performance, achieving state-of-the-art BLEU scores across multiple language pairs in both latency regimes without compromising quality-latency trade-offs. The proposed mechanisms—particularly the WAIT token and Decoder Time Dilation—offer a scalable, efficient, and adaptive framework for real-time translation systems. The supervised fine-tuning strategy further enhances adaptability under delay conditions, adding robustness to deployment scenarios.
Key Points
- ▸ Introduction of WAIT token mechanism for causal alignment
Merits
Innovation
Hikari eliminates reliance on heuristics or policies by integrating causal alignment via probabilistic WAIT tokens, presenting a more scalable and transparent architecture.
Performance
Achieves new SOTA BLEU scores in both low- and high-latency regimes across multiple languages, validating efficacy.
Adaptability
Decoder Time Dilation reduces autoregressive overhead and improves training distribution, enhancing generalization.
Demerits
Complexity
The WAIT token mechanism may introduce interpretability challenges for researchers unfamiliar with causal modeling; potential for nuanced tuning requirements.
Generalizability
Evaluation is limited to English-to-Japanese, German, and Russian; broader multilingual applicability remains unproven.
Expert Commentary
Hikari represents a significant conceptual leap in simultaneous translation by fundamentally reimagining the control flow of autoregressive models through causal alignment. The WAIT token mechanism, while conceptually elegant, demands careful calibration to avoid over-optimization artifacts, particularly in low-resource scenarios. Decoder Time Dilation’s impact on training balance is particularly noteworthy—it mitigates the classic problem of sequence-length bias without compromising latency performance. Importantly, the supervised fine-tuning component introduces a pragmatic bridge between research and deployment, addressing a critical gap in real-world system robustness. While the evaluation scope is currently constrained, the framework’s modularity suggests strong potential for extension into low-resource or domain-specific settings. This work exemplifies how theoretical elegance can translate into tangible operational gains, and it sets a new benchmark for evaluating such systems: beyond BLEU, latency-aware evaluation metrics must evolve to capture the full impact of innovations like Hikari.
Recommendations
- ✓ Researchers should extend Hikari’s architecture to include dynamic batching and adaptive sampling for heterogeneous networks.
- ✓ Industry stakeholders should evaluate Hikari’s model under edge-device constraints and latency-sensitive use cases (e.g., live captioning, emergency communications) to validate real-world impact.