A Hierarchical End-of-Turn Model with Primary Speaker Segmentation for Real-Time Conversational AI
arXiv:2603.13379v1 Announce Type: new Abstract: We present a real-time front-end for voice-based conversational AI to enable natural turn-taking in two-speaker scenarios by combining primary speaker segmentation with hierarchical End-of-Turn (EOT) detection. To operate robustly in multi-speaker environments, the system continuously identifies and tracks the primary user, ensuring that downstream EOT decisions are not confounded by background conversations. The tracked activity segments are fed to a hierarchical, causal EOT model that predicts the immediate conversational state by independently analyzing per-speaker speech features from both the primary speaker and the bot. Simultaneously, the model anticipates near-future states ($t{+}10/20/30$\,ms) through probabilistic predictions that are aware of the conversation partner's speech. Task-specific knowledge distillation compresses wav2vec~2.0 representations (768\,D) into a compact MFCC-based student (32\,D) for efficient deployment.
arXiv:2603.13379v1 Announce Type: new Abstract: We present a real-time front-end for voice-based conversational AI to enable natural turn-taking in two-speaker scenarios by combining primary speaker segmentation with hierarchical End-of-Turn (EOT) detection. To operate robustly in multi-speaker environments, the system continuously identifies and tracks the primary user, ensuring that downstream EOT decisions are not confounded by background conversations. The tracked activity segments are fed to a hierarchical, causal EOT model that predicts the immediate conversational state by independently analyzing per-speaker speech features from both the primary speaker and the bot. Simultaneously, the model anticipates near-future states ($t{+}10/20/30$\,ms) through probabilistic predictions that are aware of the conversation partner's speech. Task-specific knowledge distillation compresses wav2vec~2.0 representations (768\,D) into a compact MFCC-based student (32\,D) for efficient deployment. The system achieves 82\% multi-class frame-level F1 and 70.6\% F1 on Backchannel detection, with 69.3\% F1 on a binary Final vs.\ Others task. On an end-to-end turn-detection benchmark, our model reaches 87.7\% recall vs.\ 58.9\% for Smart Turn~v3 while keeping a median detection latency of 36\,ms versus 800--1300\,ms. Despite using only 1.14\,M parameters, the proposed model matches or exceeds transformer-based baselines while substantially reducing latency and memory footprint, making it suitable for edge deployment.
Executive Summary
This article presents a novel real-time conversational AI model that integrates primary speaker segmentation with hierarchical End-of-Turn (EOT) detection for natural turn-taking in two-speaker scenarios. The proposed system continuously tracks the primary user and uses a hierarchical, causal EOT model to predict conversational states, achieving state-of-the-art performance on various benchmarks. Notably, the model significantly reduces latency and memory footprint compared to transformer-based baselines, making it suitable for edge deployment. The results demonstrate the effectiveness of the proposed approach in real-time conversational AI, particularly in multi-speaker environments. The article's contributions and insights have significant implications for the development of more efficient and effective conversational AI systems.
Key Points
- ▸ The proposed model integrates primary speaker segmentation with hierarchical EOT detection for real-time conversational AI.
- ▸ The system continuously tracks the primary user to ensure robust operation in multi-speaker environments.
- ▸ The hierarchical, causal EOT model achieves state-of-the-art performance on various benchmarks.
- ▸ The model significantly reduces latency and memory footprint compared to transformer-based baselines.
Merits
Strength in Robustness
The proposed system's ability to continuously track the primary user and operate robustly in multi-speaker environments is a significant strength.
Advancements in Efficiency
The model's significant reduction in latency and memory footprint compared to transformer-based baselines makes it suitable for edge deployment.
Demerits
Limited Evaluation
The article primarily focuses on two-speaker scenarios and does not extensively evaluate the model's performance in more complex multi-speaker settings.
Expert Commentary
The proposed model represents a significant advancement in the field of conversational AI, particularly in terms of its robustness and efficiency. However, further evaluation in more complex multi-speaker scenarios is necessary to fully assess its capabilities. The article's focus on edge deployment is timely, given the growing need for AI systems that can operate efficiently in resource-constrained environments. Overall, the proposed model is a valuable contribution to the field of conversational AI, and its findings have significant implications for both practical applications and policy development.
Recommendations
- ✓ Future research should focus on evaluating the proposed model's performance in more complex multi-speaker scenarios.
- ✓ Developers should consider the proposed model's efficiency and robustness when designing conversational AI systems for edge deployment.