-
❓ Questions and HelpHey there, For some reason I am not able to detect short words like "Hi" or "Hello". I'm able to detect longer phrases. I am passing in chunks of 240ms at a time. Each incremental 120ms, I am throwing away the first 120ms of data (to ensure we are "crawling" through the audio in steps of 120ms each time). Here are the params I'm using:
Looking at the raw probabilities, each time I say "hello" I see a block of at least size 5, where the probabilities are ≥0.9. Since each block is 32ms (256 / 8000), shouldn't this mean we have at least 160ms of speech above the specified threshold? But for some reason the VAD is returning nothing. Things work consistently great when we pass in larger blocks, e.g. 400ms. But shouldn't it still work with a 240ms block, given the low value of I've attached the raw probabilities for the audio, as well as a screenshot of the visualizations. Thanks so much in advance for your help! |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 7 replies
-
Hi, Can you please share your audio?
Does the problem occur with the standard provided function?
The VAD is recurrent. The provided utils are written in such a manner, that this is taken into consideration.
The VAD works well with even chunks of 30-100ms. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the quick reply. Here is the audio file: https://www.dropbox.com/s/0tl2661wpfm26bn/bytes_jun15.wav?dl=0 Sorry, the It sounds like I should not be junking the previous audio every time I add new audio? |
Beta Was this translation helpful? Give feedback.
-
If this is the same stream, then no. If you just use the provided utils (do not forget to set the proper SR) event with the default settings you get:
If you listen to the audio containing only the speech, then it picks up the fist hello.
If you use the streaming utils (do not forget about setting the sampling rate):
You will get a similar result:
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Just wanted to make sure I'm doing things the right way. In my Twilio audio stream, I am getting chunks of 160 bytes (20ms of audio @ 8000 sample rate) each time. What is the correct way to use VadIterator for this to continuously detect speech starts and stops? I'm thinking that because VADIterator recommends a window size of 256 for an 8k sample rate, I should chain every new set of 256 bytes (i.e. every 1.6 packets) together, convert to WAV, read using read_audio, and pass the result into VADIterator. Is this the recommended approach? Should I be storing memory of the previous chunks, or will VADIterator be handling that for me? |
Beta Was this translation helpful? Give feedback.
-
@snakers4 I figured out why my conversion from int8 to float32 was wrong -- turns out, I have to first convert from mu-law to linear encoding, and then call
However, there is still some small difference (in terms of additional noise) when I use this method (right side of screenshot), compared to converting the bytestring to wav via |
Beta Was this translation helpful? Give feedback.
If this is the same stream, then no.
If these audios are different - then yes.
There are utils for streaming as well, if you are after streaming.
If you just use the provided utils (do not forget to set the proper SR) event with the default settings you get: