[MM-54242] Improve timestamp accuracy #3

streamer45 · 2023-11-10T02:10:50Z

Summary

PR adds a speech detection step prior to transcribing. This was made necessary as whisper.cpp doesn't currently support word level timestamps (see ggerganov/whisper.cpp#375) so portions of silence could skew the sync of the final transcription by quite a bit.

The way this works is by first feeding the decoded audio samples to the Silero VAD model and essentially trim silence from the track, splitting it into separate speech segments before passing it to the Whisper engine for transcription.

@jupenur Apologies for the late addition of a non trivial dependency. Let me know if you have any questions.

Ticket Link

https://mattermost.atlassian.net/browse/MM-54242

cpoile

Awesome (as usual)! :)

cpoile · 2023-11-10T16:37:47Z

cmd/transcriber/call/tracks.go

+			startSampleOff := int(seg.SpeechStartAt * trackOutAudioRate)
+			endSampleOff := int(seg.SpeechEndAt * trackOutAudioRate)


One request -- can you add some comments for future us about what the units are for seg.SpeechStartAt and startSampleOff? (That will help in the future when I'm reviewing and remembering what the numbers mean.)

Sure but the sample offset is just what it says. It's an index to the first sample where the speech starts (+- some padding), there's no unit really.

build/prepare_deps.sh

* Implement more human friendly filenames * Include Silero VAD model v4 * Cache CGO dependencies and Whisper models in Docker build * Update silero-vad-go * Add SHA check for ONNXRuntime * Support language autodetection * Initial multi-threading support * Fix marshaling case * Tune speech detector silence duration threshold * Sanitize Text and Speaker strings * Build as position-independent executable * Update rtcd client dependency * Better escaping

streamer45 added 2 commits November 9, 2023 17:45

Implement speech detection step to improve transcription accuracy

1d62dbc

Update build files

cbacf7a

streamer45 added 2: Dev Review Requires review by a core committer 3: Security Review labels Nov 10, 2023

streamer45 added this to the v0.1.0 milestone Nov 10, 2023

streamer45 requested review from cpoile and jupenur November 10, 2023 02:10

streamer45 self-assigned this Nov 10, 2023

cpoile approved these changes Nov 10, 2023

View reviewed changes

streamer45 removed the 2: Dev Review Requires review by a core committer label Nov 10, 2023

Add comments

d0fd928

jupenur requested changes Nov 14, 2023

View reviewed changes

build/prepare_deps.sh Show resolved Hide resolved

streamer45 requested a review from jupenur November 15, 2023 17:22

jupenur approved these changes Nov 15, 2023

View reviewed changes

streamer45 added 3: Reviews Complete All reviewers have approved the pull request and removed 3: Security Review labels Nov 16, 2023

streamer45 merged commit 92909b8 into MM-53934 Nov 16, 2023
2 checks passed

streamer45 deleted the MM-54242 branch November 16, 2023 16:15

aiaimimi0920 mentioned this pull request Jan 16, 2024

Missing some text V-Sekai/godot-whisper#38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MM-54242] Improve timestamp accuracy #3

[MM-54242] Improve timestamp accuracy #3

streamer45 commented Nov 10, 2023

cpoile left a comment

cpoile Nov 10, 2023

streamer45 Nov 10, 2023

		startSampleOff := int(seg.SpeechStartAt * trackOutAudioRate)
		endSampleOff := int(seg.SpeechEndAt * trackOutAudioRate)

[MM-54242] Improve timestamp accuracy #3

[MM-54242] Improve timestamp accuracy #3

Conversation

streamer45 commented Nov 10, 2023

Summary

Ticket Link

cpoile left a comment

Choose a reason for hiding this comment

cpoile Nov 10, 2023

Choose a reason for hiding this comment

streamer45 Nov 10, 2023

Choose a reason for hiding this comment