Chunk time vs number of sample inconsistency. #211

Rodolfo-S · 2022-08-08T02:41:12Z

Rodolfo-S
Aug 8, 2022

Why are the number of samples for each chunk size different from the time in milliseconds on the same chart? In the "Performance Metric" page the supported chunk sizes listed are 30, 60, and 100 milliseconds which, at 16kHz sampling rate should be 480, 960, and 1600 samples. However, the number of samples in that same column in parenthesis are listed as 512, 1024, and 1536 which correspond to 32, 64, and 96ms, respectively.

Now I'm wondering which one is incorrect. Should I be feeding the model 480 samples at a time or 512 samples?

Answered by snakers4

Aug 8, 2022

Hi @Rodolfo-S ,

Please use the sample numbers from the docstring:

silero-vad/utils_vad.py

Lines 131 to 165 in 7c671a7

      This method is used for splitting long audios into speech chunks using silero VAD  
     
    Parameters  
    ----------  
    audio: torch.Tensor, one dimensional  
    One dimensional float torch.Tensor, other types are casted to torch if possible  
     
    model: preloaded .jit silero VAD model  
     
    threshold: float (default - 0.5)  
    Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH.  
    It is better to tune this parameter for each dataset separately, but "…

View full answer

snakers4 · 2022-08-08T04:05:32Z

snakers4
Aug 8, 2022
Maintainer

Hi @Rodolfo-S ,

Please use the sample numbers from the docstring:

silero-vad/utils_vad.py

Lines 131 to 165 in 7c671a7

    
               This method is used for splitting long audios into speech chunks using silero VAD 
        
               Parameters 
        
               ---------- 
        
               audio: torch.Tensor, one dimensional 
        
                   One dimensional float torch.Tensor, other types are casted to torch if possible 
        
               model: preloaded .jit silero VAD model 
        
               threshold: float (default - 0.5) 
        
                   Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH. 
        
                   It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets. 
        
               sampling_rate: int (default - 16000) 
        
                   Currently silero VAD models support 8000 and 16000 sample rates 
        
               min_speech_duration_ms: int (default - 250 milliseconds) 
        
                   Final speech chunks shorter min_speech_duration_ms are thrown out 
        
               min_silence_duration_ms: int (default - 100 milliseconds) 
        
                   In the end of each speech chunk wait for min_silence_duration_ms before separating it 
        
               window_size_samples: int (default - 1536 samples) 
        
                   Audio chunks of window_size_samples size are fed to the silero VAD model. 
        
                   WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate and 256, 512, 768 samples for 8000 sample rate. 
        
                   Values other than these may affect model perfomance!! 
        
               speech_pad_ms: int (default - 30 milliseconds) 
        
                   Final speech chunks are padded by speech_pad_ms each side 
        
               return_seconds: bool (default - False) 
        
                   whether return timestamps in seconds (default - samples) 
        
               visualize_probs: bool (default - False) 
        
                   whether draw prob hist or not

Why are the number of samples for each chunk size different from the time in milliseconds on the same chart? In the "Performance Metric" page the supported chunk sizes listed are 30, 60, and 100 milliseconds which, at 16kHz sampling rate should be 480, 960, and 1600 samples. However, the number of samples in that same column in parenthesis are listed as 512, 1024, and 1536 which correspond to 32, 64, and 96ms, respectively.

We just rounded up for simplicity of presentation.
We used to have more exact numbers of samples, but after model optimization it had to become a multiple of 2**N.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk time vs number of sample inconsistency. #211

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

	This method is used for splitting long audios into speech chunks using silero VAD

	Parameters
	----------
	audio: torch.Tensor, one dimensional
	One dimensional float torch.Tensor, other types are casted to torch if possible

	model: preloaded .jit silero VAD model

	threshold: float (default - 0.5)
	Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH.
	It is better to tune this parameter for each dataset separately, but "…

Chunk time vs number of sample inconsistency. #211

Rodolfo-S Aug 8, 2022

Replies: 1 comment

snakers4 Aug 8, 2022 Maintainer

Rodolfo-S
Aug 8, 2022

snakers4
Aug 8, 2022
Maintainer