-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange insertion of words not resembling what I spoke, even after I stop speaking! #158
Comments
Hi. Sorry for very late reply. I was vacationing ⛱️. The problem you are seeing is a "hallucination" in the Whisper model. According to some rumors, OpenAI trained its models on audio and transcripts extracted from YouTube, which is why this "Thank you for watching" appears. There really isn't much I can do to fix this. Large-v3 seems to be the most affected, so if you use it, I recommend switching to Large-v2. |
Thanks a lot mkiol and I didn't think it was a late reply, I really appreciate your response and hope you had a lovely holiday :) I suppose I thought that version 3 would be better than version 2. Would there be any downsides to switching to version 2? Just so I can consider the pros and cons? Thank you |
Large-v2 has a sightly worse accuracy. Here is a detailed error rate for particular languages. The lower WER the better accuracy. For some languages the difference between v3 and v2 is minimal. Personally I always use Large-v2 to avoid this "hallucination" bug. |
Thanks for that mkiol that's very interesting. On reflection I think it may only add words when there is silence. It would be great if in future, there were some way to stop it inserting words when there is silence. Or perhaps version 4 will fix this bug? I will try out v2 meanwhile since the error rate for the English versions are not that different. It is interesting to see the error rate for different languages in the link you sent. I wonder if there could be a way users could train it when it produces errors, so that it learns? Or maybe that it is too difficult and/or a can of worms... |
You are absolutely right. This hallucination only occurs with non-speech audio data. The best way to avoid it is to remove the non-voice part from the audio stream. This is already implemented in Speech Note with preprocessing VAD. This is done with a very fast but not very accurate algorithm borrowed from the WebRTC project. I think this can be improved. Another approach might be to use a specially trained neural network to detect silence like Silero VAD.
Interesting idea but quite difficult to implement. Speech Note uses only static models and any use of these models doesn't change them. Training is right now out of the scope. |
Thank you mkiol, actually now that I think about it, the hallucination only happens when I am speaking live and pausing in between sentences. If I run an mp3 recording of something through it, it doesn't seem to do this. Thank you for advising that training is currently out of scope. I had seen that some languages were more accurate than others and wondered whether there might be a way to increase the accuracy of those with a lower accuracy level. Fair enough! |
Hi mkiol unfortunately I'm having the same issue with words I didn't speak being inserted with Whisper Large v2 :( (both Large-v2 and Distil Large-v2) :( |
Thank you very much for this wonderful program, it has very high accuracy levels and is helping me so much in many ways :)
But unfortunately Speech Note keeps inserting words that I didn't speak randomly into the text. Usually it's "thank you" when I didn't say "thank you" and often other words too that bear no resemblance to anything I spoke. This also happens after I stop speaking. I left it running just now while there was no sound and it randomly inserted this text during the silence:
"Good so much. It's all right. Be it. Thank you. Oh, oh. So, okay. Okay. Okay. Oh, so, woman! Five! Thanks. Thank you."
Often it will insert "thank you" more than 5 times after I stop speaking. And it adds other random strange phrases. "Thank you for watching" is another frequent phrase. One time it added "God bless you!" which I certainly did not say!
Perhaps this is some bug? I wanted to let you know
Thanks!
The text was updated successfully, but these errors were encountered: