-
Notifications
You must be signed in to change notification settings - Fork 160
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
option --suppress_token
to reduce hallucinations / output special noise descriptions
#105
Comments
First here is a small piece of code to know what this token means: import whisper
tokenizer = whisper.tokenizer.get_tokenizer(True, task="transcribe", language="en")
tokenizer.decode_with_timestamps([50364])
# Out[3]: '<|0.00|>' So this tokens is the first timestamp token, meaning we are at time "0.00" and we want whisper decoder model to predict timestamps at the end of each segment. Now I don't understand what you mean by "suppress this token" : how do you do this? There is a mode in which Whisper model can predict the transcription without predicting timestamps. |
Thanks! OK that's interesting. Actually the only thing I did was to add the parameter --suppress_tokens 50364. Here's the whole invocation (for a single sample): |
Thanks a lot @misutoneko for the clarification. Okay so it's getting really interesting.
It seems that the first point has no influence on hallucinations on silences, but that the second has. Also this gave me an idea of making Whisper decode in the " |
Very nice, I knew you'd make sense of it :D |
Yes exactly. On my side, I played a bit with But I'm happy for you if you have a great experience with this. |
--suppress_token
to reduce hallucinations / output special noise descriptions
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Hi again,
I've now (finally) taken a peek at .words.json files, and it immediately paid off :D
I noticed that with medium model (with --language en), the first token is always 50364.
It's some kind of special token I guess, but I couldn't find any direct references, nor do I have any idea where it comes from.
Long story short, if I suppress this token that will totally eradicate any hallucinations related to non-speech clips.
The clip will get a reasonable description of any noise or music instead => yay :D
So, is there a reason for this token to exist? Perhaps it should be suppressed by default.
I haven't noticed any downsides to suppressing it, but I guess it's possible that some utterances might go undetected if they genuinely contain this token.
EDIT:
It seems this token is the same in all the multilingual models.
For english-only models the token is 50363 (I didn't test the large ones though, they're probably the same).
The text was updated successfully, but these errors were encountered: