Is there any function to detect speech and silence in Silero-Vad? #201

harshraj3223 · 2022-06-27T19:43:17Z

harshraj3223
Jun 27, 2022

Hey there,
As you know, we use the method 'is_speech()' under 'webrtcvad' to detect if a specified interval is silent or has speech. Do we have anything similar to 'is_speech()' in 'Silero-vad' as well? I mean something that returns a bool value for a provided chunk? Please provide the method name, if available, or its suitable alternative.

Answered by snakers4

Jun 28, 2022

Hi,

You can just rewrite this function to return a single bool value if the speech was detected:

silero-vad/utils_vad.py

Lines 119 to 130 in 7c671a7

     def get_speech_timestamps(audio: torch.Tensor,  
   model,  
   threshold: float = 0.5,  
   sampling_rate: int = 16000,  
   min_speech_duration_ms: int = 250,  
   min_silence_duration_ms: int = 100,  
   window_size_samples: int = 1536,  
   speech_pad_ms: int = 30,  
   return_seconds: bool = False,  
   visualize_probs: bool = False):  
    
   """  

 

If you do not want to mess with this function, you can just write a wrapper that provides a True value if there is more speech than the specified threshold.

Also note that th…

View full answer

snakers4 · 2022-06-28T02:14:02Z

snakers4
Jun 28, 2022
Maintainer

Hi,

You can just rewrite this function to return a single bool value if the speech was detected:

silero-vad/utils_vad.py

Lines 119 to 130 in 7c671a7

    
           def get_speech_timestamps(audio: torch.Tensor, 
        
                                     model, 
        
                                     threshold: float = 0.5, 
        
                                     sampling_rate: int = 16000, 
        
                                     min_speech_duration_ms: int = 250, 
        
                                     min_silence_duration_ms: int = 100, 
        
                                     window_size_samples: int = 1536, 
        
                                     speech_pad_ms: int = 30, 
        
                                     return_seconds: bool = False, 
        
                                     visualize_probs: bool = False): 
        
               """

If you do not want to mess with this function, you can just write a wrapper that provides a True value if there is more speech than the specified threshold.

Also note that the VAD itself returns a bool value for a given audio chunk:

silero-vad/utils_vad.py

Line 210 in 7c671a7

speech_prob = model(chunk, sampling_rate).item()

1 reply

harshraj3223 Jun 28, 2022
Author

Thanks a lot for the help!
So, I'm going to use speech_prob = model(chunk, sampling_rate).item() to decide if a given chunk is voiced or silent. Please find your time to answer the following questions. It'd really be a great help!!

(I) Can you please tell me all the arguments or parameters for this 'model()' function except for 'chunk' and 'sampling_rate'?

(ii) What would be the most efficient value for the threshold if I'm taking an audio stream from the microphone?

(iii) In the model(chunk, 16000) function, is it necessary to have the chunk value lie only in {512, 1024, 1536} ? If yes, what can be done if I want the model to wait for at least 1 second of silence/pause before turning the speech into a new line? (Context: I'm working on a speech-to-text model)

(iv) When I'm using Silero-VAD in my speech-to-text model, it doesn't seem to detect single words like 'Hey' or 'to', which are said in a short time interval. Also, when speaking a whole hefty sentence, it often fails to detect the beginning word of the sentence. Can you please guide me through this?

snakers4 · 2022-06-28T09:33:01Z

snakers4
Jun 28, 2022
Maintainer

(I) Can you please tell me all the arguments or parameters for this 'model()' function except for 'chunk' and 'sampling_rate'?

The necessary parameters are listed in the above function.

(ii) What would be the most efficient value for the threshold if I'm taking an audio stream from the microphone?

Depends on the application.
It can be easily tuned with the provided visualize_probs flag.

(iii) In the model(chunk, 16000) function, is it necessary to have the chunk value lie only in {512, 1024, 1536} ?

silero-vad/utils_vad.py

Lines 153 to 156 in 7c671a7

    
               window_size_samples: int (default - 1536 samples) 
        
                   Audio chunks of window_size_samples size are fed to the silero VAD model. 
        
                   WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate and 256, 512, 768 samples for 8000 sample rate. 
        
                   Values other than these may affect model perfomance!!

If yes, what can be done if I want the model to wait for at least 1 second of silence/pause before turning the speech into a new line? (Context: I'm working on a speech-to-text model)

Do not pass any chunks or start a new VAD session.

(iv) When I'm using Silero-VAD in my speech-to-text model, it doesn't seem to detect single words like 'Hey' or 'to', which are said in a short time interval. Also, when speaking a whole hefty sentence, it often fails to detect the beginning word of the sentence. Can you please guide me through this?

Most likely you need to tune some of there parameters, see the visualize_probs flag:

silero-vad/utils_vad.py

Lines 121 to 125 in 7c671a7

    
           threshold: float = 0.5, 
        
           sampling_rate: int = 16000, 
        
           min_speech_duration_ms: int = 250, 
        
           min_silence_duration_ms: int = 100, 
        
           window_size_samples: int = 1536,

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any function to detect speech and silence in Silero-Vad? #201

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

	def get_speech_timestamps(audio: torch.Tensor,
	model,
	threshold: float = 0.5,
	sampling_rate: int = 16000,
	min_speech_duration_ms: int = 250,
	min_silence_duration_ms: int = 100,
	window_size_samples: int = 1536,
	speech_pad_ms: int = 30,
	return_seconds: bool = False,
	visualize_probs: bool = False):

	"""

Is there any function to detect speech and silence in Silero-Vad? #201

harshraj3223 Jun 27, 2022

Replies: 2 comments · 1 reply

snakers4 Jun 28, 2022 Maintainer

harshraj3223 Jun 28, 2022 Author

snakers4 Jun 28, 2022 Maintainer

harshraj3223
Jun 27, 2022

Replies: 2 comments 1 reply

snakers4
Jun 28, 2022
Maintainer

harshraj3223 Jun 28, 2022
Author

snakers4
Jun 28, 2022
Maintainer