Please share your improvement ideas! #16

snakers4 · 2021-01-19T07:01:07Z

snakers4
Jan 19, 2021
Maintainer

Please share ideas on how / what to improve.

For example, some obvious ones:

Add music detection
Add speaker change detection
Add more languages (which ones, would you like to contribute some data?)

dophist · 2021-01-24T07:03:37Z

dophist
Jan 24, 2021

Thanks for the effort, Alex, great project!

Regarding to VAD, I'm wondering if there is any plan on releasing more sophisticated top-level segmentation logics beyond frame-level pretrained model.

Say, 4 commonly used options are:

min-sil-duration: to smooth out "sentence-internal" short pause
max-sil-duration: so called "time-out" in streaming processing
min-speech-duration: to smooth out non-stationary "spiking" voice-like noises
max-speech-duration: to set hard-length-limit of segmentation

I see current codes do have a ring-buffer cache(for top-level average-smoothing), this is definitely somewhere that can be improved to support more complicated segmentation strategies & controls. I think those things can make silero-vad more friendly to end-users from non-speech background.

Besides that, supports of multimedia input formats other than well-formatted 16k16bit pcm/wav(say, mp3 or even mp4/mkv) can help other projects to use silero-vad as a battery-included module.

At the end, again, thanks for sharing silero-vad :)

6 replies

dophist Jan 24, 2021

min-sil-duration: to smooth out "sentence-internal" short pause
max-sil-duration: so called "time-out" in streaming processing
min-speech-duration: to smooth out non-stationary "spiking" voice-like noises
max-speech-duration: to set hard-length-limit of segmentation

Could you please elaborate in details how each of these strategies works and which benefits it brings?
I tried some naïve searches but to no avail.

actually you already have some codes dealing one of these scenarios. Take https://github.com/snakers4/silero-vad/blob/master/utils.py#L100 for example, it acts like "min-speech-duration" among the four factors I listed above.

Most vad engine embeds these kind of strategies inside a "state-machine-like" post-processing stage, which is functionally equivalent to, but more complex than, your caching buffer here https://github.com/snakers4/silero-vad/blob/master/utils.py#L87. In practice, these settings are exposed to users(for a legit reason), to deal with cases with different speech speed, disfluency, strong music interference etc.

snakers4 Jan 24, 2021
Maintainer Author

Oh, I see now
You do not not some complex strategies, but just some hardcoded params like:

minimal silence duration
maximal silence duration
minimal speech duration
maximal speech duration

minimal silence duration

is essentially implicitly handled via chunking
i.e. if we work with 100ms or 250ms chunks we make an obvious assumption that silence shorter than this chunk is not meaningful

minimal speech duration

yeah, you are right, this probably should be a param and not hard-coded
@adamnsandle

maximal silence duration
maximal speech duration

well, we do not have this
in real life pauses and speech can be as long as possible (from a standpoint of VAD)
but typically 95% of speech is within 1-7 seconds (which is very long in terms of VAD)
5% mostly falls into 7-15s range
silence also can be arbitrarily long. there is some difference what we mean by silence, but here we tried to measure the silence ONE person makes when speaking (not when 2 people are changing rounds talking - then silence is essentially waiting for the other person to stop talking)

dophist Jan 24, 2021

Thanks for sharing the distribution data, pretty informative.

some more thoughts to share on these:

the significance of "max speech duration" is that it makes the output of VAD bounded. Without it, super-long segments can happen in some cases. For example, a debate speech segment of 400 secs(failed to be segmented due to fast speaking speed) is send to VAD-downstream module, say a bi-directional RNN, memory could blowup. Take one step back, as a vad engine, having the ability to bound max speech duration and setting it loosely, is more robust than theoretically unbounded.
the significance of "maximal silence duration" is due to the fact that VAD engines can be used in streaming applications. Upper level applications are not supposed to wait infinitely until the "end signal" of the entire long silence. It is essentially a conditional early-stoping-and-reset for VAD engine in case of really long silence.

In practice there are more cases that VAD need to handle other than those listed above, dealing all cases indeed increase complexity of VAD codes. VAD is a common pitfall for almost every single speech application I've seen. WebRTC failed to provide a good solution to open-source domain, I believe Silero-VAD is a good start to raise the publicly-available VAD performance to a new level, Cheers.

snakers4 Jan 24, 2021
Maintainer Author

the significance of "max speech duration" is that it makes the output of VAD bounded. Without it, super-long segments can happen in some cases. For example, a debate speech segment of 400 secs(failed to be segmented due to fast speaking speed) is send to VAD-downstream module, say a bi-directional RNN, memory could blowup. Take one step back, as a vad engine, having the ability to bound max speech duration and setting it loosely, is more robust than theoretically unbounded.

yeah, you are totally right here
and what is even funnier - we caught this "bug" with web RTC in production with fast-taking and constantly interrupting radio hosts - it is literally an endless stream of consciousness

the only problem is that unless some meta-strategy does help (i.e. try to separate as-is, then lower your thresholds and try again, lower parameters controlling unnatural short pauses, e.g. sighs, repeat, maybe recursively) you essentially have to just cut audio into N second chunks and call it a day

obviously this may introduce artefacts into downstream STT pipeline, just imagine any long word being split in two

basically you may likely lose 2-4 words because of this. I know no normal remedy except that you need to run STT 2 times on such chunks, i.e. one time to get word timestamps and for the second time you need to recut the audio

but here it becomes interesting - since we provide a toolkit, we should not assume know much about user's use case - maybe he just wants to suppress silence and write non silence into a file. if we add some arbitrary "forced interruption" token this may contradict the philosophy of the barebones toolkit.

we cannot say that there is no speech there (because there is), probably provided the user sets this param. we can just output 2 chunks immediately after another instead of one. for a proper solution you need full STT

@adamnsandle seems also like a good zero-cost quality of life improvement for tinkerers

the significance of "maximal silence duration" is due to the fact that VAD engines can be used in streaming applications. Upper level applications are not supposed to wait infinitely until the "end signal" of the entire long silence. It is essentially a conditional early-stoping-and-reset for VAD engine in case of really long silence.

this is also a valid point and actually we are designing a proper interface for our EE client, but the key is tight integration between STT and VAD
but this more looks like an interface thing, i.e. this should be implemented in the end client / server where the VAD is running, the VAD itself should be "dumb"
@Islanna please think if you can off-load some of your logic here in the utils maybe?
or maybe just share a final streaming gRPC service, I believe this piece of logic belongs there more

snakers4 Jan 24, 2021
Maintainer Author

dealing all cases indeed increase complexity of VAD codes

I cannot emphasize enough though that we should not over-think everything for a user because the end users hopefully will be using our VAD as a black box in their environment of choice (not only Python, but maybe Java or NodeJS, or whatever works with ONNX) kind of building upon our barebones examples

A careful separation of concerns should be kept, because adding one if statement is 100x easier than unravelling glitchy bloated third-party code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please share your improvement ideas! #16

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Please share your improvement ideas! #16

snakers4 Jan 19, 2021 Maintainer

Replies: 1 comment · 6 replies

dophist Jan 24, 2021

dophist Jan 24, 2021

snakers4 Jan 24, 2021 Maintainer Author

dophist Jan 24, 2021

snakers4 Jan 24, 2021 Maintainer Author

snakers4 Jan 24, 2021 Maintainer Author

snakers4
Jan 19, 2021
Maintainer

Replies: 1 comment 6 replies

dophist
Jan 24, 2021

snakers4 Jan 24, 2021
Maintainer Author

snakers4 Jan 24, 2021
Maintainer Author

snakers4 Jan 24, 2021
Maintainer Author