Strange insertion of words not resembling what I spoke, even after I stop speaking! #158

Getarhubar · 2024-09-01T09:12:53Z

Thank you very much for this wonderful program, it has very high accuracy levels and is helping me so much in many ways :)

But unfortunately Speech Note keeps inserting words that I didn't speak randomly into the text. Usually it's "thank you" when I didn't say "thank you" and often other words too that bear no resemblance to anything I spoke. This also happens after I stop speaking. I left it running just now while there was no sound and it randomly inserted this text during the silence:

"Good so much. It's all right. Be it. Thank you. Oh, oh. So, okay. Okay. Okay. Oh, so, woman! Five! Thanks. Thank you."

Often it will insert "thank you" more than 5 times after I stop speaking. And it adds other random strange phrases. "Thank you for watching" is another frequent phrase. One time it added "God bless you!" which I certainly did not say!

Perhaps this is some bug? I wanted to let you know

Thanks!

mkiol · 2024-09-14T16:26:55Z

Hi. Sorry for very late reply. I was vacationing ⛱️.

The problem you are seeing is a "hallucination" in the Whisper model. According to some rumors, OpenAI trained its models on audio and transcripts extracted from YouTube, which is why this "Thank you for watching" appears. There really isn't much I can do to fix this. Large-v3 seems to be the most affected, so if you use it, I recommend switching to Large-v2.

Getarhubar · 2024-09-15T09:33:01Z

Thanks a lot mkiol and I didn't think it was a late reply, I really appreciate your response and hope you had a lovely holiday :) I suppose I thought that version 3 would be better than version 2. Would there be any downsides to switching to version 2? Just so I can consider the pros and cons?

Thank you

mkiol · 2024-09-15T18:19:16Z

Would there be any downsides to switching to version 2? Just so I can consider the pros and cons?

Large-v2 has a sightly worse accuracy. Here is a detailed error rate for particular languages. The lower WER the better accuracy. For some languages the difference between v3 and v2 is minimal.

Personally I always use Large-v2 to avoid this "hallucination" bug.

Getarhubar · 2024-09-16T04:49:58Z

Thanks for that mkiol that's very interesting. On reflection I think it may only add words when there is silence. It would be great if in future, there were some way to stop it inserting words when there is silence. Or perhaps version 4 will fix this bug? I will try out v2 meanwhile since the error rate for the English versions are not that different.

It is interesting to see the error rate for different languages in the link you sent. I wonder if there could be a way users could train it when it produces errors, so that it learns? Or maybe that it is too difficult and/or a can of worms...

mkiol · 2024-09-16T18:09:02Z

It would be great if in future, there were some way to stop it inserting words when there is silence

You are absolutely right. This hallucination only occurs with non-speech audio data. The best way to avoid it is to remove the non-voice part from the audio stream. This is already implemented in Speech Note with preprocessing VAD. This is done with a very fast but not very accurate algorithm borrowed from the WebRTC project. I think this can be improved. Another approach might be to use a specially trained neural network to detect silence like Silero VAD.

I wonder if there could be a way users could train it when it produces errors, so that it learns?

Interesting idea but quite difficult to implement. Speech Note uses only static models and any use of these models doesn't change them. Training is right now out of the scope.

Getarhubar · 2024-09-18T09:40:11Z

Thank you mkiol, actually now that I think about it, the hallucination only happens when I am speaking live and pausing in between sentences. If I run an mp3 recording of something through it, it doesn't seem to do this.

Thank you for advising that training is currently out of scope. I had seen that some languages were more accurate than others and wondered whether there might be a way to increase the accuracy of those with a lower accuracy level. Fair enough!

Getarhubar · 2024-11-07T09:25:39Z

Hi mkiol unfortunately I'm having the same issue with words I didn't speak being inserted with Whisper Large v2 :( (both Large-v2 and Distil Large-v2) :(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange insertion of words not resembling what I spoke, even after I stop speaking! #158

Strange insertion of words not resembling what I spoke, even after I stop speaking! #158

Getarhubar commented Sep 1, 2024

mkiol commented Sep 14, 2024

Getarhubar commented Sep 15, 2024

mkiol commented Sep 15, 2024

Getarhubar commented Sep 16, 2024 •

edited

Loading

mkiol commented Sep 16, 2024

Getarhubar commented Sep 18, 2024

Getarhubar commented Nov 7, 2024

Strange insertion of words not resembling what I spoke, even after I stop speaking! #158

Strange insertion of words not resembling what I spoke, even after I stop speaking! #158

Comments

Getarhubar commented Sep 1, 2024

mkiol commented Sep 14, 2024

Getarhubar commented Sep 15, 2024

mkiol commented Sep 15, 2024

Getarhubar commented Sep 16, 2024 • edited Loading

mkiol commented Sep 16, 2024

Getarhubar commented Sep 18, 2024

Getarhubar commented Nov 7, 2024

Getarhubar commented Sep 16, 2024 •

edited

Loading