Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voice Activity Detection #1468

Closed
vincentqb opened this issue Apr 21, 2021 · 7 comments
Closed

Voice Activity Detection #1468

vincentqb opened this issue Apr 21, 2021 · 7 comments

Comments

@vincentqb
Copy link
Contributor

I'd love to hear from our users what is the impact of their favorite VAD algorithm on their model's performance at both training and inference (offline, online, streaming).

  • Is there one we should add in torchaudio?
  • Does anyone uses webrtc's vad implementation mentioned in comment?

We currently have in torchaudio

cc @mthrok @astaff

@PetrochukM
Copy link

PetrochukM commented Apr 21, 2021

Voice Activity Detection has been important for my work TTS and STT. It's useful for segmenting large audio files before training. I'd love a basic VAD implementation akin to:
https://maelfabien.github.io/project/Speech_proj/#pros-and-cons

webrtc didn't work for me because it didn't expose enough parameters for tunning. The default version of webrtc didn't work well for clean datasets... It tended to overcorrect for noise. It'd cut off unvoiced consonants because it thought they were noise... I tried all of the available settings in py-webrtcvad to correct for this.

I don't think there is one perfect algorithm for VAD because you need to make all sorts of assumptions about the SNR ratio. So, I prefer a VAD which can be tuned by hand for different situations: no noise, minimal background noise, medium background noise, etc... and speaker levels: whisper, conversational, speech, etc...

The one currently in torchaudio didn't work for me because it is focused on audio trimming instead of detecting voice throughout the audio.

@vincentqb
Copy link
Contributor Author

vincentqb commented Apr 21, 2021

great response @PetrochukM and thanks a lot for the input :) have you also tried this one?

@vincentqb
Copy link
Contributor Author

(quick note: some elements of the algorithm you suggested are also similar to one of the pitch detection algorithm we have)

@PetrochukM
Copy link

Oh. I didn't. Thanks for the tip! It looks pretty close to what I was describing :)

Also, another issue that we had with VAD was memory. For some reason, scipy doesn't support memmap. This made it particularly difficult to work with files longer than 1 hour.

I actually tried to use the pitch detection algorithm in torchaudio but it was just too slow compared to other algorithms. I don't have too many details right now but it was much slower than something like loudness detection. I can give more details, in another thread, sometime later!

@vincentqb
Copy link
Contributor Author

@PetrochukM, have you tried the kaldi voice activity detection? also mentioned in pytorch forum

@PetrochukM
Copy link

PetrochukM commented Jun 10, 2021

I considered it but I decided against it during my research. I vaguely remember being overwhelmed by the complexity of it. I hope my POV helps!

@vincentqb
Copy link
Contributor Author

I hope my POV helps!

Definitely, thank you :)

@mthrok mthrok closed this as completed Jul 22, 2021
mthrok pushed a commit to mthrok/audio that referenced this issue Dec 13, 2022
* Show all Learn the Basics content in the Left Nav

Testing out the look of creating a separate heading in the left nav for all the learn the basics content.

* Collapse Learning PyTorch by default

With the addition of the Introduction to PyTorch (Learn the Basics) section on the left nav, we now want to collapse the original Learning PyTorch section.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants