Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MM-54242] Improve timestamp accuracy #3

Merged
merged 4 commits into from
Nov 16, 2023
Merged

[MM-54242] Improve timestamp accuracy #3

merged 4 commits into from
Nov 16, 2023

Conversation

streamer45
Copy link
Contributor

Summary

PR adds a speech detection step prior to transcribing. This was made necessary as whisper.cpp doesn't currently support word level timestamps (see ggerganov/whisper.cpp#375) so portions of silence could skew the sync of the final transcription by quite a bit.

The way this works is by first feeding the decoded audio samples to the Silero VAD model and essentially trim silence from the track, splitting it into separate speech segments before passing it to the Whisper engine for transcription.

@jupenur Apologies for the late addition of a non trivial dependency. Let me know if you have any questions.

Ticket Link

https://mattermost.atlassian.net/browse/MM-54242

@streamer45 streamer45 added 2: Dev Review Requires review by a core committer 3: Security Review labels Nov 10, 2023
@streamer45 streamer45 added this to the v0.1.0 milestone Nov 10, 2023
@streamer45 streamer45 self-assigned this Nov 10, 2023
Copy link
Member

@cpoile cpoile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome (as usual)! :)

Comment on lines +352 to +353
startSampleOff := int(seg.SpeechStartAt * trackOutAudioRate)
endSampleOff := int(seg.SpeechEndAt * trackOutAudioRate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One request -- can you add some comments for future us about what the units are for seg.SpeechStartAt and startSampleOff? (That will help in the future when I'm reviewing and remembering what the numbers mean.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but the sample offset is just what it says. It's an index to the first sample where the speech starts (+- some padding), there's no unit really.

@streamer45 streamer45 removed the 2: Dev Review Requires review by a core committer label Nov 10, 2023
build/prepare_deps.sh Show resolved Hide resolved
* Implement more human friendly filenames

* Include Silero VAD model v4

* Cache CGO dependencies and Whisper models in Docker build

* Update silero-vad-go

* Add SHA check for ONNXRuntime

* Support language autodetection

* Initial multi-threading support

* Fix marshaling case

* Tune speech detector silence duration threshold

* Sanitize Text and Speaker strings

* Build as position-independent executable

* Update rtcd client dependency

* Better escaping
@streamer45 streamer45 added 3: Reviews Complete All reviewers have approved the pull request and removed 3: Security Review labels Nov 16, 2023
@streamer45 streamer45 merged commit 92909b8 into MM-53934 Nov 16, 2023
2 checks passed
@streamer45 streamer45 deleted the MM-54242 branch November 16, 2023 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3: Reviews Complete All reviewers have approved the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants