-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to handle interruptions better while building speech to speech pipeline? #156
Comments
That's a pretty general question. Interruption handling is tricky because it often requires echo cancellation of the voice agent's TTS output. Latency, on the other hand, is all about balancing. A fast STT system should transcribe in under 100ms on a strong GPU, and a decent TTS system adds around 200ms. The rest of the delay comes from LLM generation or speech end detection. Most basic speech endpoint detection methods rely on waiting for a certain amount of silence, which naturally adds latency. For better latency: Make sure your STT, LLM, and TTS are as fast as possible. Use a more advanced speech endpoint detection method, like adjusting the silence threshold based on real-time transcription (e.g., detecting end punctuation) or analyzing frequency changes. People often lower their pitch when finishing a thought or raise it for questions. For interruption handling: Remove TTS feedback from the input and apply volume-based thresholds afterward. |
first of all this is an insanely well done and robust library. second, any more clues on where to look to 'Remove TTS feedback from the input' ? |
Easy way: mute the mic when TTS is running. Like this: When TTS starts: recorder.abort()
recorder.stop() When TTS stops: recorder.clear_audio_queue()
recorder.recording_stop_time = 0
recorder.wakeup() Harder way: add echo cancellation. Haven’t nailed this reliably yet, but would be awesome to see some working code. This would allow you to interrupt the voice agent mid-response. |
Any examples of end to end speech to speech pipeline for better latency and interruption handling?
The text was updated successfully, but these errors were encountered: