❓ Questions - Minimum speech duration? #349

mukundt · 2023-06-15T07:51:54Z

mukundt
Jun 15, 2023

❓ Questions and Help

Hey there,

For some reason I am not able to detect short words like "Hi" or "Hello". I'm able to detect longer phrases. I am passing in chunks of 240ms at a time. Each incremental 120ms, I am throwing away the first 120ms of data (to ensure we are "crawling" through the audio in steps of 120ms each time). Here are the params I'm using:

get_speech_timestamps_custom(
                    wav,
                    model,
                    return_seconds=True,
                    sampling_rate=8000,
                    window_size_samples=256,
                    min_speech_duration_ms=80,
                    threshold=0.85,
                    min_silence_duration_ms=0,
                )

Looking at the raw probabilities, each time I say "hello" I see a block of at least size 5, where the probabilities are ≥0.9. Since each block is 32ms (256 / 8000), shouldn't this mean we have at least 160ms of speech above the specified threshold? But for some reason the VAD is returning nothing.

Things work consistently great when we pass in larger blocks, e.g. 400ms. But shouldn't it still work with a 240ms block, given the low value of min_speech_duration_ms?

I've attached the raw probabilities for the audio, as well as a screenshot of the visualizations. Thanks so much in advance for your help!
raw_probs.txt

Answered by snakers4

Jun 15, 2023

It sounds like I should not be junking the previous audio every time I add new audio?

If this is the same stream, then no.
If these audios are different - then yes.
There are utils for streaming as well, if you are after streaming.

If you just use the provided utils (do not forget to set the proper SR) event with the default settings you get:

SAMPLING_RATE = 8000
wav = read_audio('bytes_jun15.wav', sampling_rate=SAMPLING_RATE)
# get speech timestamps from full audio file
speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE)
pprint(speech_timestamps)

# merge all speech chunks to one audio
save_audio('only_speech.wav',
           collect_chunks(speech_timesta…

View full answer

snakers4 · 2023-06-15T07:56:50Z

snakers4
Jun 15, 2023
Maintainer

Hi,

Can you please share your audio?

get_speech_timestamps_custom(

Does the problem occur with the standard provided function?

Each incremental 120ms, I am throwing away the first 120ms of data (to ensure we are "crawling" through the audio in steps of 120ms each time).

The VAD is recurrent. The provided utils are written in such a manner, that this is taken into consideration.
If some portion of audio is omitted, the VAD may work incorrectly.

Things work consistently great when we pass in larger blocks, e.g. 400ms. But shouldn't it still work with a 240ms block, given the low value of min_speech_duration_ms?

The VAD works well with even chunks of 30-100ms.

1 reply

Nullarity Jul 1, 2024

The VAD works well with even chunks of 30-100ms.

Interesting, in my case (live voice recording from microphone, 16khz), the VAD starts working reliably from 500 milliseconds.

mukundt · 2023-06-15T08:02:22Z

mukundt
Jun 15, 2023
Author

Thanks for the quick reply. Here is the audio file: https://www.dropbox.com/s/0tl2661wpfm26bn/bytes_jun15.wav?dl=0

Sorry, the _custom part was just because I made some additional print statements. It functions exactly the same as the standard provided function.

It sounds like I should not be junking the previous audio every time I add new audio?

0 replies

snakers4 · 2023-06-15T08:24:19Z

snakers4
Jun 15, 2023
Maintainer

It sounds like I should not be junking the previous audio every time I add new audio?

If this is the same stream, then no.
If these audios are different - then yes.
There are utils for streaming as well, if you are after streaming.

If you just use the provided utils (do not forget to set the proper SR) event with the default settings you get:

SAMPLING_RATE = 8000
wav = read_audio('bytes_jun15.wav', sampling_rate=SAMPLING_RATE)
# get speech timestamps from full audio file
speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE)
pprint(speech_timestamps)

# merge all speech chunks to one audio
save_audio('only_speech.wav',
           collect_chunks(speech_timestamps, wav), sampling_rate=SAMPLING_RATE) 
Audio('only_speech.wav')

speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE, visualize_probs=True)

If you listen to the audio containing only the speech, then it picks up the fist hello.

[{'end': 9456, 'start': 5392},
 {'end': 63216, 'start': 59152},
 {'end': 95472, 'start': 92432},
 {'end': 120560, 'start': 118032},
 {'end': 143088, 'start': 140048},
 {'end': 158448, 'start': 149776},
 {'end': 173296, 'start': 170256},
 {'end': 197872, 'start': 194832},
 {'end': 222448, 'start': 219408},
 {'end': 236272, 'start': 228112}]

If you use the streaming utils (do not forget about setting the sampling rate):

vad_iterator = VADIterator(model, sampling_rate=SAMPLING_RATE)
wav = read_audio(f'bytes_jun15.wav', sampling_rate=SAMPLING_RATE)

window_size_samples = 256 # number of samples in a single audio chunk
for i in range(0, len(wav), window_size_samples):
    chunk = wav[i: i+ window_size_samples]
    if len(chunk) < window_size_samples:
      break
    speech_dict = vad_iterator(chunk, return_seconds=True)
    if speech_dict:
        print(speech_dict, end=' ')
vad_iterator.reset_states() # reset model states after each audio

You will get a similar result:

{'start': 0.7} {'end': 1.2} {'start': 7.5} {'end': 7.9} {'start': 11.6} {'end': 12.0} {'start': 14.8} {'end': 15.1} {'start': 17.5} {'end': 17.9} {'start': 18.8} {'end': 19.8} {'start': 21.3} {'end': 21.7} {'start': 24.4} {'end': 24.8} {'start': 27.5} {'end': 27.8} {'start': 28.6} {'end': 29.5}

0 replies

adamnsandle · 2023-06-15T08:35:12Z

adamnsandle
Jun 15, 2023
Collaborator

get_speech_timestamps calls reset_states method before inference. Better not to use get_speech_timestamps in case of cutting audio into small chunks, because VAD model needs context and at least few audio frames (warmup)

0 replies

mukundt · 2023-06-15T20:41:54Z

mukundt
Jun 15, 2023
Author

Just wanted to make sure I'm doing things the right way. In my Twilio audio stream, I am getting chunks of 160 bytes (20ms of audio @ 8000 sample rate) each time. What is the correct way to use VadIterator for this to continuously detect speech starts and stops?

I'm thinking that because VADIterator recommends a window size of 256 for an 8k sample rate, I should chain every new set of 256 bytes (i.e. every 1.6 packets) together, convert to WAV, read using read_audio, and pass the result into VADIterator. Is this the recommended approach?

Should I be storing memory of the previous chunks, or will VADIterator be handling that for me?
And should I call reset_states only when the phone call is complete?

5 replies

snakers4 Jun 16, 2023
Maintainer

The vad model itself keeps the state and yes, you should reset on phone call end

Not sure that converting to wav is necessary, see the example, iterator just accepts the sliced array

Also make sure not to forget to set the correct sampling rate in the iterator

mukundt Jun 16, 2023
Author

Sounds good, thank you. The reason I'm converting to wav is because I have a byte string, rather than a numpy array. If I don't convert to wav, I get the error message "Audio cannot be casted to tensor. Cast it manually".

Do you have any tips on casting a bytestring to tensor manually?

mukundt Jun 16, 2023
Author

By the way...a little bit of a weird issue. If the wav I pass into the function is 256 bytes (i.e. vad_iterator runs for just one iteration), performance becomes a little shakey. But if I buffer the wav into 256 * 5 bytes (i.e. 5 iterations of vad_iterator), speech detection is seamless. Is this to be expected, or am I doing something wrong?

I was thinking that maybe this is because the former case involves calling the "convert bytestring to wav" function too often? Not sure, though.

snakers4 Jun 16, 2023
Maintainer

I have a byte string, rather than a numpy array
"convert bytestring to wav" function too often?

Probably you have to do some profiling.
It is possible that the first run takes more time with PyTorch.
It is advisable to use ONNX anyway, since for VAD it is 3-4x faster and it is less flaky for this particular model for some reason.

As for arrays / tensors - just work back from the format that is accepted by utils.
I do not remember, but it either accepts a torch tensor for PyTorch model or numpy array for ONNX model.
Do not remember whether it has to be int or float, just look it up in the utils code.
There was a fun issue when we were passing wrong format (int instead of float or vice versa, and it worked with a slight degradation).

Hope this helps.

mukundt Jun 16, 2023
Author

Will check out the utils code, thanks again for all your help so far. Really appreciate it :)

I'll try ONNX again...for some reason it did not work well when I tried it the first time, though I might have messed something up in the implementation. With the ONNX model, I have just been passing the result from read_audio straight into VADIterator, just like I have been doing with the default model. I didn't know that the format had to be different? Looking at the __call__ function it seems like the accepted parameter is a torch.Tensor regardless of model type. Am I interpreting something wrongly?

I'm trying to convert a bytestring -> numpy array -> Torch.tensor:

def read_audio_custom(wav):
    if wav.size(0) > 1:
        print("wav.size(0) > 1")
        wav = wav.mean(dim=0, keepdim=True)
    return wav.squeeze(0)

# chained_chunk is a bytestring from Twilio
numpy_arr = np.frombuffer(bytearray(chained_chunk))
print("numpy_arr", numpy_arr)
wav = torch.from_numpy(numpy_arr)
wav = read_audio_custom(wav)

However, passing the result into VAD produces errors:

in get_speech_timestamps:
    audio_length_samples = len(audio)
                           ^^^^^^^^^^
in __len__
    raise TypeError("len() of a 0-d tensor")
TypeError: len() of a 0-d tensor

mukundt · 2023-06-21T18:07:53Z

mukundt
Jun 21, 2023
Author

@snakers4 I figured out why my conversion from int8 to float32 was wrong -- turns out, I have to first convert from mu-law to linear encoding, and then call Int2Float. Here is my updated code, which is working much better:

def Int2Float(sound):
    _sound = np.copy(sound)  #
    abs_max = np.abs(_sound).max()
    _sound = _sound.astype("float32")
    if abs_max > 0:
        _sound *= 1 / 255 # 8-bit audio is unsigned
    audio_float32 = torch.from_numpy(_sound.squeeze())
    return audio_float32

chained_chunk = audioop.ulaw2lin(chained_chunk, 1)
audio_int8 = np.frombuffer(chained_chunk, np.int8)
audio_float32 = Int2Float(audio_int8)
wav = audio_float32

However, there is still some small difference (in terms of additional noise) when I use this method (right side of screenshot), compared to converting the bytestring to wav via pydub.AudioSegment (left side of screenshot). Do you know why that might be the case?

1 reply

snakers4 Jun 22, 2023
Maintainer

I have to first convert from mu-law to linear encoding, and then call Int2Float

This is correct, all outer utils assume that everything is linear everywhere.
There is some code inside of the model that auto normalizes audio anyway, but with the extreme values of compression it may produce quirky results - i.e. later detection.

Do you know why that might be the case?

Looks like some additional compression or gating happening somewhere (the left chart).
There is no exact right or wrong here, just compare the probabilities and audio charts step by step and tune the hyper-parameters that fit your use case.

When training the model, we just took raw values from wav / opus and applied linear conversion. There may be some issues with very loud or very quiet audios, but as a general rule we tried to avoid extremes in our pre-processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

❓ Questions - Minimum speech duration? #349

{{title}}

Replies: 6 comments 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

❓ Questions - Minimum speech duration? #349

mukundt Jun 15, 2023

❓ Questions and Help

Replies: 6 comments · 7 replies

snakers4 Jun 15, 2023 Maintainer

Nullarity Jul 1, 2024

mukundt Jun 15, 2023 Author

snakers4 Jun 15, 2023 Maintainer

adamnsandle Jun 15, 2023 Collaborator

mukundt Jun 15, 2023 Author

snakers4 Jun 16, 2023 Maintainer

mukundt Jun 16, 2023 Author

mukundt Jun 16, 2023 Author

snakers4 Jun 16, 2023 Maintainer

mukundt Jun 16, 2023 Author

mukundt Jun 21, 2023 Author

snakers4 Jun 22, 2023 Maintainer

mukundt
Jun 15, 2023

Replies: 6 comments 7 replies

snakers4
Jun 15, 2023
Maintainer

mukundt
Jun 15, 2023
Author

snakers4
Jun 15, 2023
Maintainer

adamnsandle
Jun 15, 2023
Collaborator

mukundt
Jun 15, 2023
Author

snakers4 Jun 16, 2023
Maintainer

mukundt Jun 16, 2023
Author

mukundt Jun 16, 2023
Author

snakers4 Jun 16, 2023
Maintainer

mukundt Jun 16, 2023
Author

mukundt
Jun 21, 2023
Author

snakers4 Jun 22, 2023
Maintainer