-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MM-54242] Improve timestamp accuracy #3
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome (as usual)! :)
startSampleOff := int(seg.SpeechStartAt * trackOutAudioRate) | ||
endSampleOff := int(seg.SpeechEndAt * trackOutAudioRate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One request -- can you add some comments for future us about what the units are for seg.SpeechStartAt
and startSampleOff
? (That will help in the future when I'm reviewing and remembering what the numbers mean.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure but the sample offset is just what it says. It's an index to the first sample where the speech starts (+- some padding), there's no unit really.
* Implement more human friendly filenames * Include Silero VAD model v4 * Cache CGO dependencies and Whisper models in Docker build * Update silero-vad-go * Add SHA check for ONNXRuntime * Support language autodetection * Initial multi-threading support * Fix marshaling case * Tune speech detector silence duration threshold * Sanitize Text and Speaker strings * Build as position-independent executable * Update rtcd client dependency * Better escaping
Summary
PR adds a speech detection step prior to transcribing. This was made necessary as
whisper.cpp
doesn't currently support word level timestamps (see ggerganov/whisper.cpp#375) so portions of silence could skew the sync of the final transcription by quite a bit.The way this works is by first feeding the decoded audio samples to the Silero VAD model and essentially trim silence from the track, splitting it into separate speech segments before passing it to the Whisper engine for transcription.
@jupenur Apologies for the late addition of a non trivial dependency. Let me know if you have any questions.
Ticket Link
https://mattermost.atlassian.net/browse/MM-54242