Automatically adds "Thank you" #1592

gkarmas · 2023-12-04T18:01:39Z

Testing the large v3 model on a word-by-word transcript output, when there is no audio at the end, it always adds "Thank you"

bobqianic · 2023-12-05T01:31:39Z

That's hallucination.

https://arxiv.org/abs/2311.14648

gkarmas · 2023-12-05T01:58:21Z

Interesting thanks for sharing. Is this fixable on the model? I'm stripping it programmatically for now

shylock74 · 2023-12-05T17:34:36Z

openai/whisper#928

misutoneko · 2023-12-05T22:18:01Z

As I've mentioned in that openai whisper thread, I got rid of these with the --suppress_tokens command line switch.
Looks like whisper.cpp doesn't have that, but the BEG token can be suppressed in whisper_process_logits(),
just add this line:
logits[vocab.token_beg] = -INFINITY;

It will cause you to get descriptions of sound events instead of "Thank you".
That's a bit easier to deal with, I think.

EDIT: Around the line 4600 or so (there are similar lines for other tokens there).
EDIT2: Note that this doesn't work the same way in whisper.cpp, we need something else. No timestamps mode was not a problem for me, but I guess it's only because my clips are usually very short.

JRWSP · 2023-12-05T23:15:39Z

As I've mentioned in that openai whisper thread, I got rid of these with the --suppress_tokens command line switch. Looks like whisper.cpp doesn't have that, but the BEG token can be suppressed in whisper_process_logits(), just add this line: logits[vocab.token_beg] = -INFINITY;

It will cause you to get descriptions of sound events instead of "Thank you". That's a bit easier to deal with, I think.

Can you give more details, where to add the line into? I don't know c++.

bobqianic · 2023-12-05T23:35:31Z

@JRWSP Add it to this function.

whisper.cpp/whisper.cpp

Line 4535 in 3163090

static void whisper_process_logits(

bobqianic · 2023-12-05T23:39:03Z

As I've mentioned in that openai whisper thread, I got rid of these with the --suppress_tokens command line switch. Looks like whisper.cpp doesn't have that, but the BEG token can be suppressed in whisper_process_logits(), just add this line: logits[vocab.token_beg] = -INFINITY;

It will cause you to get descriptions of sound events instead of "Thank you". That's a bit easier to deal with, I think.

Could you explain how removing the BEG token (begin time stamps) helps in reducing hallucinations?

misutoneko · 2023-12-06T10:33:13Z

Well if I've understood this correctly, suppressing the non-speech tokens causes the BEG token to emerge somehow (rather than NOT/no timestamps token), and that's what causes these hallucinations.
(I don't think I saw any of these problematic tokens coming up after the NOT token)
So in that sense simply suppressing BEG might help, or not.
But the NOT token (no timestamps mode) might not be very desirable either.

The workaround that I used for whisper/whisper-timestamped was to allow non-speech tokens.
Here's the original thread:
linto-ai/whisper-timestamped#107

I suppose this could all be fixed in the training data too, but that's something we plebs don't get to see.
Btw I have tested this mostly with medium and small models (haven't tried large-v3). The "en" models use a different token id.

EDIT: OK it was a nice theory, but it doesn't hold up (for whisper.cpp).
whisper.cpp does have a parameter for non_speech_tokens, and they're allowed by default.
So must be something else going on.
I actually tried to replicate the --suppress "" mode with whisper.cpp (by allowing everything through without filtering),
but it didn't seem to help much. Maybe there's just difference between these two codecases in how the calculations are done.

PR1588 has some samples for testing.

misutoneko · 2023-12-10T13:39:24Z

Hmmm, do these hallucinated tokens always have low probability?
Because if so, they could be easily filtered out based on that.
But there could be a risk is that some useful tokens might get lost (low-quality audio?).

Another idea I haven't seen mentioned is that prompting can sometimes help (for short clips?).
Even if the prompt is just " " that will change the output.

bobqianic added the question Further information is requested label Dec 5, 2023

bobqianic added the enhancement New feature or request label Jan 15, 2024

bobqianic linked a pull request Jan 15, 2024 that will close this issue

Fix the decoding issues #1768

Open

11 tasks

aiaimimi0920 mentioned this issue Jan 16, 2024

Do something more for the silence mic hallucinating part V-Sekai/godot-whisper#20

Closed

jensdraht1999 mentioned this issue Sep 1, 2024

Latest 1.6.2 release substantial increase in hallucinations for large-v3 on CUDA #2191

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically adds "Thank you" #1592

Automatically adds "Thank you" #1592

gkarmas commented Dec 4, 2023

bobqianic commented Dec 5, 2023

gkarmas commented Dec 5, 2023

shylock74 commented Dec 5, 2023

misutoneko commented Dec 5, 2023 •

edited

Loading

JRWSP commented Dec 5, 2023

bobqianic commented Dec 5, 2023

bobqianic commented Dec 5, 2023

misutoneko commented Dec 6, 2023 •

edited

Loading

misutoneko commented Dec 10, 2023

Automatically adds "Thank you" #1592

Automatically adds "Thank you" #1592

Comments

gkarmas commented Dec 4, 2023

bobqianic commented Dec 5, 2023

gkarmas commented Dec 5, 2023

shylock74 commented Dec 5, 2023

misutoneko commented Dec 5, 2023 • edited Loading

JRWSP commented Dec 5, 2023

bobqianic commented Dec 5, 2023

bobqianic commented Dec 5, 2023

misutoneko commented Dec 6, 2023 • edited Loading

misutoneko commented Dec 10, 2023

misutoneko commented Dec 5, 2023 •

edited

Loading

misutoneko commented Dec 6, 2023 •

edited

Loading