-
Notifications
You must be signed in to change notification settings - Fork 8.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
attempt to fix the repetition/hallucination issue identified in #1046 #1052
Conversation
Hi @jongwook Not sure if you saw the comment below, but it includes a reproduction case which might be useful: The repetition persists with this PR. |
@ryanheise thanks! will look into it... |
The problem triggered by the test data from @ryanheise is model sensitive. I see the problem with ryan-test-sub.mp4 |
I can confirm this fixed my example, thanks! 👍
@glangford FYI the subtitles didn't show in your video. |
@ryanheise Inline, (on Mac at least) you may need to click on the >> on the right to turn on subtitles. Or download and view with VLC, Quicktime, or whatever and enable subtitles in the viewer. |
Ah, I see, Firefox doesn't show any options, but downloading it and opening in VLC works. You can also do hard subs this way #435 (reply in thread)
Here is the 69-whiskey-clip.mp4 |
Btw have you guys tried with longer audio, e.g. 5 mins long? I am still getting a lot of repetition even with this fix.
I was hoping to update word segmentation results for whisper-only word timestamps in our paper https://arxiv.org/abs/2303.00747 But currently i am getting better results with our implementation which is similar to https://github.com/linto-ai/whisper-timestamped |
I am testing a longer audio now (running on CPU, larger model, transcript+transcribe so it is taking a while). For clarity,
It seems like there are different possible sources of error, in all the different discussions
|
yes
yes |
@jongwook Note from @m-bain example above the repetition occurring with verbose print. The repetitions in this example are all "instantaneous" ; eg same start and end time
they are printed but then immediately cleared by this code, which looks like a bug unique to Line 345 in aac47c9
|
@m-bain Given this could you maybe rerun and see if the formal output formats are messed up or not, using |
This is not a verbose error, and the start times and end times of repetition are not always instantaneous, see output for the .srt file without verbose: 271 272 273 274 275 276 277 278 |
So there are at least two problems then
Given how close the start/end times are it feels like something related to Line 337 in aac47c9
@m-bain Do the same repetitions happen with |
Update, I realise there is some specific underline formatting in the word_timestamps, was able to get it working in the end. See here for comparison on word-level timestamp accuracy @jongwook could you share the evaluation for long-form transcription WER? I am unable to reproduce whisper results, right now I report in the vanilla setting -- greedy/beam5 decoding without the heuristic tricks |
…i#1046 (openai#1052) * attempt to fix the repetition/hallucination issue identified in openai#1046 * zero-pad the audio instead of spectrogram * formatting fix * delete debug print
…i#1046 (openai#1052) * attempt to fix the repetition/hallucination issue identified in openai#1046 * zero-pad the audio instead of spectrogram * formatting fix * delete debug print
…i#1046 (openai#1052) * attempt to fix the repetition/hallucination issue identified in openai#1046 * zero-pad the audio instead of spectrogram * formatting fix * delete debug print
No description provided.