option `--suppress_token` to reduce hallucinations / output special noise descriptions #105

misutoneko · 2023-06-23T16:54:11Z

Hi again,

I've now (finally) taken a peek at .words.json files, and it immediately paid off :D
I noticed that with medium model (with --language en), the first token is always 50364.
It's some kind of special token I guess, but I couldn't find any direct references, nor do I have any idea where it comes from.

Long story short, if I suppress this token that will totally eradicate any hallucinations related to non-speech clips.
The clip will get a reasonable description of any noise or music instead => yay :D

So, is there a reason for this token to exist? Perhaps it should be suppressed by default.
I haven't noticed any downsides to suppressing it, but I guess it's possible that some utterances might go undetected if they genuinely contain this token.

EDIT:
It seems this token is the same in all the multilingual models.
For english-only models the token is 50363 (I didn't test the large ones though, they're probably the same).

Jeronymous · 2023-06-23T17:28:33Z

First here is a small piece of code to know what this token means:

import whisper
tokenizer = whisper.tokenizer.get_tokenizer(True, task="transcribe", language="en")
tokenizer.decode_with_timestamps([50364])
# Out[3]: '<|0.00|>'

So this tokens is the first timestamp token, meaning we are at time "0.00" and we want whisper decoder model to predict timestamps at the end of each segment.

Now I don't understand what you mean by "suppress this token" : how do you do this?

There is a mode in which Whisper model can predict the transcription without predicting timestamps.
I can imagine that it changes the behavior, and you seem to tell that it reduces hallucinations (which is good to know).
But then whisper-timestamped is unusable in that mode.
So again: what do you do exactly?

misutoneko · 2023-06-23T19:01:39Z

Thanks! OK that's interesting. Actually the only thing I did was to add the parameter --suppress_tokens 50364.
My use case is a bit special in that I do the transcripting for very short clips (as described in #82).

Here's the whole invocation (for a single sample):
CUDA_VISIBLE_DEVICES=-1 /usr/local/bin/whisper_timestamped --threads 2 --device cpu --output_format srt,json --language en --model medium --vad True --suppress_tokens 50364 --output_dir . clip_19_3127.wav

Jeronymous · 2023-06-26T21:51:45Z

Thanks a lot @misutoneko for the clarification.

Okay so it's getting really interesting.
What you actually do when you use "--suppress_tokens 50364" is doing two things

You suppress to Whisper the possibility to decode a segment start at <|0.00|>.
I played a bit with that, and saw that in practice, early starts will be predicted at <|0.02|>, <|0.04|>, ... instead
You allow Whisper to decode non speech tokens (which means special words like "*noise"). Because suppress_tokens is -1 by default, which means those tokens that do not correspond to text.
(note: if you want to suppress the prediction of 50364 and also of those non speech tokens, you can use "--suppress_tokens -1,50364")

It seems that the first point has no influence on hallucinations on silences, but that the second has.
This is a big discovery for me.
But it is still early for me to conclude and adapt the code to this. I am not used to see those special words coming in the transcription. I have to do some experiments.

Also this gave me an idea of making Whisper decode in the "<no timestamp>" mode (as I thought you were doing in my first comment) : this could also change Whisper behavior related to hallucinations and omissions. Because I guess that the training data used to train in that mode could be different from the Youtube subtitled videos used to train the "with timestamps" mode.
(reminder: all the predictions like "thanks for watching this video" that we can see on silences occur because of some subtitles biases in the training data).

misutoneko · 2023-06-27T10:12:10Z

Very nice, I knew you'd make sense of it :D
So it looks like I could've used --suppress_tokens "" just as well.
(The results will be slightly different of course, as it won't suppress anything then).

Jeronymous · 2023-06-27T21:33:15Z

Yes exactly.

On my side, I played a bit with suppress_tokens and I was disappointed.
In some case, it does not remove hallucinations (at least with "accurate" decoding, it seems you are using the more "efficient" one), and in some case it replaces hallucinations with those special words, that are not necessarily easier to filter out.
Those statistical models are unpredictable...

But I'm happy for you if you have a great experience with this.

Jeronymous changed the title ~~Significance of token 50364?~~ option --suppress_token to reduce hallucinations / output special noise descriptions Jun 27, 2023

Jeronymous added documentation Improvements or additions to documentation question Further information is requested labels Jun 27, 2023

linto-ai locked and limited conversation to collaborators Jun 27, 2023

Jeronymous converted this issue into discussion #107 Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

option `--suppress_token` to reduce hallucinations / output special noise descriptions #105

option `--suppress_token` to reduce hallucinations / output special noise descriptions #105

misutoneko commented Jun 23, 2023 •

edited

Loading

Jeronymous commented Jun 23, 2023 •

edited

Loading

misutoneko commented Jun 23, 2023

Jeronymous commented Jun 26, 2023 •

edited

Loading

misutoneko commented Jun 27, 2023

Jeronymous commented Jun 27, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

option --suppress_token to reduce hallucinations / output special noise descriptions #105

option --suppress_token to reduce hallucinations / output special noise descriptions #105

Comments

misutoneko commented Jun 23, 2023 • edited Loading

Jeronymous commented Jun 23, 2023 • edited Loading

misutoneko commented Jun 23, 2023

Jeronymous commented Jun 26, 2023 • edited Loading

misutoneko commented Jun 27, 2023

Jeronymous commented Jun 27, 2023

This issue was moved to a discussion.

option `--suppress_token` to reduce hallucinations / output special noise descriptions #105

option `--suppress_token` to reduce hallucinations / output special noise descriptions #105

misutoneko commented Jun 23, 2023 •

edited

Loading

Jeronymous commented Jun 23, 2023 •

edited

Loading

Jeronymous commented Jun 26, 2023 •

edited

Loading