New Fork: Web client + WebSocket + own VAD impl. #105

marcinmatys · 2024-07-08T09:45:29Z

I have created fork of whisper_streaming , so I took the liberty of writing about it here.
We may close this issue soon as it is information only.

I encourage you to check it out if you are interested in topics such as
Web Browser-Based client with WebSocket Communication,
Voice Activity Detection, and Silence Processing.

If you have any comments, please write here or check out feedback section in my README

vuduc153 · 2024-07-08T10:59:30Z

@marcinmatys Hi, thanks for the fork it's really a godsend since I was looking to put together something similar. :)
One thing I notice is that the VAD seems to reset the timestamp to 0 every time it starts again after a silence period. Is this the expected behavior?

marcinmatys · 2024-07-08T11:15:58Z

@vuduc153 Thanks for your feedback.

When silence is detected, OnlineASRProcessor finish() and init() methods are called to read uncommited transcription and clear buffer. We loose context and have uncommited transcription then, but in my opinion, it does not have a significant impact on quality. However, I must say that this implementation is just my experiment. You have to do the tests yourself and decide whether it is appropriate or not.

You could remove line online.init() from below code and check the difference.

if not silence_started:
     o = online.finish()
     online.init()

vuduc153 · 2024-07-08T14:53:09Z

@marcinmatys Thanks for the reply I just wanted to confirm if that's indeed to intended logic.
There's also an issue with really long pauses (>10s) with the current code. Since rms is calculated as the square root mean of the ongoing silence_candidate_chunk, after a long pause when the speech starts again, rms will still be under the SILENCE_THRESHOLD for a while until the new data brings the mean back up above the threshold. From my experience it would take around 1/10 the duration of the pause for the ASR to picks up again, which means the first sentence after a pause will lose some words at the beginning.

Calculating rms per received audio might be a better way to approach this. I have slightly modified the logic in this section in PR. Let me know what you think.

Gldkslfmsd · 2024-07-08T19:33:45Z

Thanks for a nice work, @marcinmatys . I shortly looked at your README2 and I found out that you're using numpy sound intensity detection as "VAD". I think that that way you can detect silence vs non-silence. What about noise vs. speech?

In the vad_streaming branch I'm using Silero VAD, a neural torch model to detect non-voice (such as noise, silence, music etc.) vs voice. It should be more robust than your numpy approach. Silero is used in the default offline Whisper as VAD and it was recommended to me in #39 .

marcinmatys · 2024-07-09T11:48:59Z

@vuduc153 Thanks for this information and PR. You are right; there is probably an issue with long pauses. However, there is also a problem with your new logic. We need to improve your fix. I will write the details in the PR comment.

marcinmatys · 2024-07-11T12:01:11Z

@Gldkslfmsd Thank you for your response and explanations.
I need to look at and test vad-streamin branch one more time and check your silence removal logic.
Do you have any plans to finally verify vad-streaming and merge it into the main branch?

Silero definitely has more capabilities as you said, but in some cases, I think numpy can also handle it. It depends on the environment we are in, whether we have noise around us, what kind of noise we have around us, and what microphone we are using.

We have two types of microphones: Headset Microphone: The microphone in a headset that is positioned near the mouth. Omnidirectional Microphone: A microphone used in conference settings that captures sound from all directions.

I performed some tests using a Headset Microphone and played some conversations (it was probably football match commentaries) from another speaker on the desk next to me. The Headset Microphone did not pick up this noise even when the other speaker was really close.

Do you thik that numpy sound intensity detection could works more efficiently than Silero ? Maybe there should be an option to use one of these. If we need a more robust tool, we use Silero, but if not, we use simple numpy.

Gldkslfmsd · 2024-07-11T16:42:29Z

@Gldkslfmsd Thank you for your response and explanations. I need to look at and test vad-streamin branch one more time and check your silence removal logic.

Do you have any plans to finally verify vad-streaming and merge it into the main branch?

It's verified, it works very well but the code is ugly. It needs to be cleaned, made transparent and self-documented. Then it can be merged.

Not in my time schedule now.

Silero definitely has more capabilities as you said, but in some cases, I think numpy can also handle it. It depends on the environment we are in, whether we have noise around us, what kind of noise we have around us, and what microphone we are using.

We have two types of microphones: Headset Microphone: The microphone in a headset that is positioned near the mouth. Omnidirectional Microphone: A microphone used in conference settings that captures sound from all directions.

I performed some tests using a Headset Microphone and played some conversations (it was probably football match commentaries) from another speaker on the desk next to me. The Headset Microphone did not pick up this noise even when the other speaker was really close.

Do you thik that numpy sound intensity detection could works more efficiently than Silero ? Maybe there should be an option to use one of these. If we need a more robust tool, we use Silero, but if not, we use simple numpy.

I believe there are some good reasons why Silero exists. Check their paper and other VAD papers. They may have it tested rigorously, you can reproduce some test.

Numpy may be faster, simpler to install, and good enough for many. If you present an evidence, we can integrate it as an option.

marcinmatys mentioned this issue Sep 2, 2024

Explanation of using VAD, VAC #117

Closed

marcinmatys mentioned this issue Oct 22, 2024

Feeding raw audio data to faster whisper over websockets #134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Fork: Web client + WebSocket + own VAD impl. #105

New Fork: Web client + WebSocket + own VAD impl. #105

marcinmatys commented Jul 8, 2024

vuduc153 commented Jul 8, 2024

marcinmatys commented Jul 8, 2024

vuduc153 commented Jul 8, 2024

Gldkslfmsd commented Jul 8, 2024

marcinmatys commented Jul 9, 2024

marcinmatys commented Jul 11, 2024

Gldkslfmsd commented Jul 11, 2024

New Fork: Web client + WebSocket + own VAD impl. #105

New Fork: Web client + WebSocket + own VAD impl. #105

Comments

marcinmatys commented Jul 8, 2024

vuduc153 commented Jul 8, 2024

marcinmatys commented Jul 8, 2024

vuduc153 commented Jul 8, 2024

Gldkslfmsd commented Jul 8, 2024

marcinmatys commented Jul 9, 2024

marcinmatys commented Jul 11, 2024

Gldkslfmsd commented Jul 11, 2024