Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle interruptions better while building speech to speech pipeline? #156

Open
mehul-fabrichq opened this issue Dec 2, 2024 · 3 comments

Comments

@mehul-fabrichq
Copy link

Any examples of end to end speech to speech pipeline for better latency and interruption handling?

@KoljaB
Copy link
Owner

KoljaB commented Dec 4, 2024

That's a pretty general question. Interruption handling is tricky because it often requires echo cancellation of the voice agent's TTS output. Latency, on the other hand, is all about balancing. A fast STT system should transcribe in under 100ms on a strong GPU, and a decent TTS system adds around 200ms. The rest of the delay comes from LLM generation or speech end detection.

Most basic speech endpoint detection methods rely on waiting for a certain amount of silence, which naturally adds latency.

For better latency:

Make sure your STT, LLM, and TTS are as fast as possible. Use a more advanced speech endpoint detection method, like adjusting the silence threshold based on real-time transcription (e.g., detecting end punctuation) or analyzing frequency changes. People often lower their pitch when finishing a thought or raise it for questions.

For interruption handling:

Remove TTS feedback from the input and apply volume-based thresholds afterward.

@adhambadr
Copy link

first of all this is an insanely well done and robust library. second, any more clues on where to look to 'Remove TTS feedback from the input' ?
Right now im having the issue the generated TTS is being picked up by the mic and fed back into the pipeline of the STT.

@KoljaB
Copy link
Owner

KoljaB commented Dec 24, 2024

Easy way: mute the mic when TTS is running. Like this:

When TTS starts:

recorder.abort()
recorder.stop()

When TTS stops:

recorder.clear_audio_queue()
recorder.recording_stop_time = 0
recorder.wakeup()

Harder way: add echo cancellation. Haven’t nailed this reliably yet, but would be awesome to see some working code. This would allow you to interrupt the voice agent mid-response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants