Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange insertion of words not resembling what I spoke, even after I stop speaking! #158

Open
Getarhubar opened this issue Sep 1, 2024 · 7 comments

Comments

@Getarhubar
Copy link

Thank you very much for this wonderful program, it has very high accuracy levels and is helping me so much in many ways :)

But unfortunately Speech Note keeps inserting words that I didn't speak randomly into the text. Usually it's "thank you" when I didn't say "thank you" and often other words too that bear no resemblance to anything I spoke. This also happens after I stop speaking. I left it running just now while there was no sound and it randomly inserted this text during the silence:

"Good so much. It's all right. Be it. Thank you. Oh, oh. So, okay. Okay. Okay. Oh, so, woman! Five! Thanks. Thank you."

Often it will insert "thank you" more than 5 times after I stop speaking. And it adds other random strange phrases. "Thank you for watching" is another frequent phrase. One time it added "God bless you!" which I certainly did not say!

Perhaps this is some bug? I wanted to let you know

Thanks!

@mkiol
Copy link
Owner

mkiol commented Sep 14, 2024

Hi. Sorry for very late reply. I was vacationing ⛱️.

The problem you are seeing is a "hallucination" in the Whisper model. According to some rumors, OpenAI trained its models on audio and transcripts extracted from YouTube, which is why this "Thank you for watching" appears. There really isn't much I can do to fix this. Large-v3 seems to be the most affected, so if you use it, I recommend switching to Large-v2.

@Getarhubar
Copy link
Author

Thanks a lot mkiol and I didn't think it was a late reply, I really appreciate your response and hope you had a lovely holiday :) I suppose I thought that version 3 would be better than version 2. Would there be any downsides to switching to version 2? Just so I can consider the pros and cons?

Thank you

@mkiol
Copy link
Owner

mkiol commented Sep 15, 2024

Would there be any downsides to switching to version 2? Just so I can consider the pros and cons?

Large-v2 has a sightly worse accuracy. Here is a detailed error rate for particular languages. The lower WER the better accuracy. For some languages the difference between v3 and v2 is minimal.

Personally I always use Large-v2 to avoid this "hallucination" bug.

@Getarhubar
Copy link
Author

Getarhubar commented Sep 16, 2024

Thanks for that mkiol that's very interesting. On reflection I think it may only add words when there is silence. It would be great if in future, there were some way to stop it inserting words when there is silence. Or perhaps version 4 will fix this bug? I will try out v2 meanwhile since the error rate for the English versions are not that different.

It is interesting to see the error rate for different languages in the link you sent. I wonder if there could be a way users could train it when it produces errors, so that it learns? Or maybe that it is too difficult and/or a can of worms...

@mkiol
Copy link
Owner

mkiol commented Sep 16, 2024

It would be great if in future, there were some way to stop it inserting words when there is silence

You are absolutely right. This hallucination only occurs with non-speech audio data. The best way to avoid it is to remove the non-voice part from the audio stream. This is already implemented in Speech Note with preprocessing VAD. This is done with a very fast but not very accurate algorithm borrowed from the WebRTC project. I think this can be improved. Another approach might be to use a specially trained neural network to detect silence like Silero VAD.

I wonder if there could be a way users could train it when it produces errors, so that it learns?

Interesting idea but quite difficult to implement. Speech Note uses only static models and any use of these models doesn't change them. Training is right now out of the scope.

@Getarhubar
Copy link
Author

Thank you mkiol, actually now that I think about it, the hallucination only happens when I am speaking live and pausing in between sentences. If I run an mp3 recording of something through it, it doesn't seem to do this.

Thank you for advising that training is currently out of scope. I had seen that some languages were more accurate than others and wondered whether there might be a way to increase the accuracy of those with a lower accuracy level. Fair enough!

@Getarhubar
Copy link
Author

Hi mkiol unfortunately I'm having the same issue with words I didn't speak being inserted with Whisper Large v2 :( (both Large-v2 and Distil Large-v2) :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants