Latest 1.6.2 release substantial increase in hallucinations for large-v3 on CUDA #2191

magnacartatron · 2024-05-29T03:33:10Z

I ran samples folder side by side on both a M1 Ultra and 4070 Ti Super, I've been keeping track since January 2024.

Latest release has substantial hallucinations. For the hp0.wav sample file it repeats same line dozens of time.

This was not the case with previous releases. Flash attention also seems to be slower for Metal for large-v3.

Have we given up on large-v3, I know it is a flawed model but for foreign languages it generally outperforms v2.

To Test: Build on ubuntu using cuda toolkit 12.5, and run "make samples" and see output especially on files like hp0 and diffusion2023-07-03

WilliamTambellini · 2024-05-29T17:42:57Z

I confirm @magnacartatron
With 1.6.1:

$ ./main -m ggml-large-v3.bin samples/SonyGroupCorpEarningsQ32024_137031005111_English.wav -l english
...
main: processing 'samples/SonyGroupCorpEarningsQ32024_137031005111_English.wav' (67585536 samples, 4224.1 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = english, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:15.520]   We will now begin FI 2023 Q3, Consolidated Financial Results Announcement for Sony Group Corporation, IAM Okada, Corporate Communications Service Master of Ceremonies.
[00:00:15.520 --> 00:00:21.700]   The people on the stage are Mr. Hiroki Totoki, President, COO, and CFO.
[00:00:22.700 --> 00:00:33.300]   Ms. Naomi Matsuoka, Senior Vice President in Charge of Corporate Planning and Control, Lead of Group Diversity Equipment and Inclusion, Support for Financial Services and Entertainment Area.
[00:00:33.300 --> 00:00:39.160]   Mr. Sadahiko Hayakawa, Senior Vice President in Charge of Finance and IR.
[00:00:39.160 --> 00:00:48.480]   These three people will be explaining the FI 2023 Q3 results and fully forecast, followed by Q&A.
[00:00:48.480 --> 00:00:52.420]   A total of 70.
[00:00:52.420 --> 00:00:53.980]   30 minutes is allocated.
[00:00:53.980 --> 00:00:55.940]   Mr. Totoki, the floor is yours.
[00:00:55.940 --> 00:01:05.140]   Today, after Mr. Matsuoka and Mr. Hayakawa explain the contents shown here, I will summarize the entire earnings briefly.
[00:01:05.140 --> 00:01:07.340]   Mr. Hayakawa, please go ahead.
[00:01:07.340 --> 00:01:11.820]   Matsuoka and Hayakawa will explain.
[00:01:11.820 --> 00:01:21.860]   Consolidated sales for the quarter were 3,747.5 billion yen, a significant increase of 22%,
[00:01:21.860 --> 00:01:22.400]   compared to the previous quarter.
[00:01:22.400 --> 00:01:22.400]   This is a significant increase of 22% compared to the previous quarter.
[00:01:22.400 --> 00:01:22.400]   This is a significant increase of 22% compared to the previous quarter.
[00:01:22.400 --> 00:01:22.400]   This is a significant increase of 22% compared to the previous quarter.

but whisper 1.5.2 outputs correctly.
@ggerganov that s a critical bug. Any way for someone from your team to work on it asap? Best.

WilliamTambellini · 2024-05-29T19:32:46Z

1.5.5 also hallucinates

WilliamTambellini · 2024-05-29T20:52:01Z

1.5.4 does NOT hallucinate.

ggerganov · 2024-05-30T06:53:07Z

There is mounting evidence that the large v3 model is broken and repeats more than v2, so I recommend not using it. It's not a whisper.cpp problem - it's the model data itself

magnacartatron · 2024-05-30T12:29:44Z

@ggerganov I agree with you, but I find it strange that this is happening with CUDA only and not Metal. On same exact sample for same exact version running side by side, Metal performs with no hallucinations while CUDA build has them. This is for version 1.6.2. I've personally given up on large-v3, (seems to run fine through python). But I'm wondering if there was a bug introduced in new versions. Additional details CUDA build made with cuda tools 12.5, current drivers 550. Metal build don't on well.... macOS 14.5 with latest Xcode updates.

MathiasSchindler · 2024-05-30T12:39:43Z

This is overlapping with the behaviour described in issue 1949 (also using CUDA, btw)
#1949

ggerganov · 2024-05-30T12:46:44Z

It's not impossible that there is a bug, but atm we don't have a good way to verify this. Ideally, we should have implemented a procedure where we measure the WER on a large set of audio and we would make sure after each change that the WER does not increase. But this is still not implemented

The only difference IIRC in v3 is that the vocab size is slightly larger. So I guess it could be possible that some of the CUDA kernels does not handle that specific kernel size well

But I'm more leaning into the possibility that v3 is simply broken and no conclusions can be made based on the results with this model - even if it appeared to work better in the past

WilliamTambellini · 2024-05-30T17:56:34Z

Neither
https://github.com/openai/whisper
nor HugginFace whisper hallucinate using the same model (v3 large) and same input.
So likely a bug in the library?
Whatever
c7dc37f
also hallucinates.
Backtracking ...

WilliamTambellini · 2024-05-30T18:38:14Z

8e409d1
does not hallucinates.

WilliamTambellini · 2024-05-30T18:44:44Z

ok hallucinations started from
2852e1a
from @josharian

josharian · 2024-05-30T18:57:31Z

I find it hard to see how those could be related. Not saying you're wrong, just that it is prima facie pretty implausible. If you revert that change at head, does the behavior you're seeing change?

ggerganov · 2024-05-30T19:26:44Z

@WilliamTambellini Please also check the examples from the original PR when we added v3 support: #1444

Back then, I tested the reference implementation provided by OpenAI with the samples that we have been using in whisper.cpp since the beginning of the project and the results are suspicious: repetitions + invalid characters. If these are still present, then I don't think we can do anything to fix this on whisper.cpp side

WilliamTambellini · 2024-05-30T21:56:49Z

Tks @ggerganov
no hallucination with
#1444
(basically 1.4.3): the output is the same than the original openAI whisper impl, and the transcription is almost perfect.
No hallucination in whispercpp 1.5.4.
Hallucinations started from 1.5.5 and later, meaning unusable for production purpose.
Here it the file:
https://file.io/7UWYV9njmAn3
hallucinations start from after about 2mn of audio.

magnacartatron · 2024-05-31T01:33:26Z

@WilliamTambellini if one uses even the built in samples, file hp0.wav its very clear immediately how the hallucinations and repetitions start. It's a small 4 minute audio file and quick to transcribe.

ggerganov · 2024-05-31T09:40:01Z

Neither openai/whisper nor HugginFace whisper hallucinate using the same model (v3 large) and same input.

Not sure what you tested since you didn't provide any details, so I just spend the time and checked again and the conclusion is exactly the same as I explained in #1444 and in this thread - the reference OpenAI implementation leads to repetitions using v3

 $ ▶ whisper ../whisper.cpp/samples/hp0.wav --model large
/opt/homebrew/lib/python3.11/site-packages/whisper/transcribe.py:115: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:11.840]  Henry F. Phillips, from Wikipedia, the free encyclopedia, at en.wikipedia.org
[00:11.840 --> 00:26.140]  Henry F. Phillips, 1890-1958
[00:26.140 --> 00:38.160]  A U.S. businessman from Portland, Oregon, has the honor of having the Phillips head screw and screwdriver named after him.
[00:39.280 --> 00:52.120]  The importance of the cross-head screw design lies in its self-centering property, useful on automated production lines that use powered screwdrivers.
[00:53.760 --> 00:56.120]  Phillips' major contribution was in...
[00:56.140 --> 01:04.640]  driving the cross-head concept forward, to the point where it was adopted by screwmakers and automobile companies.
[01:05.580 --> 01:10.380]  Although he received patents for the design in 1936,
[01:10.380 --> 01:17.720]  U.S. Patent No. 2,046,343
[01:17.720 --> 01:24.100]  U.S. Patents 2,046,837
[01:24.100 --> 01:25.380]  to 2,046,837
[01:26.140 --> 01:29.140]  to 2,046,847
[01:30.540 --> 01:37.120]  It was so widely copied, that by 1949 Phillips lost his patent.
[01:38.280 --> 01:43.680]  The American Screw Company was responsible for devising a means of manufacturing the screw,
[01:43.680 --> 01:49.280]  and successfully patented and licensed their method.
[01:51.040 --> 01:55.600]  Other screw makers of the 1930s dismissed the Phillips concept,
[01:56.140 --> 02:03.640]  since it calls for a relatively complex recessed socket shape in the head of the screw, as
[02:03.640 --> 02:09.600]  distinct from the simple milled slot of a slotted-type screw.
[02:09.600 --> 02:17.020]  The Phillips Screw Company and the American Screw Company went on to devise the posidrive
[02:17.020 --> 02:23.740]  screw, which differs from the Phillips in that it is designed to accommodate greater
[02:23.740 --> 02:27.300]  torque than the Phillips.
[02:27.300 --> 02:35.540]  An image accompanied this article, captioned, Phillips Screw Head.
[02:35.540 --> 02:43.800]  The following is an infobox which accompanies this article, infobox, part of the series
[02:43.800 --> 02:52.740]  on screw drive types, slotted, commonly, erroneously, flathead.
[02:52.740 --> 02:53.740]  Phillips.
[02:53.740 --> 02:54.740]  Crosshead.
[02:54.740 --> 02:55.740]  Posidrive.
[02:55.740 --> 02:56.740]  Superdrive.
[02:56.740 --> 02:57.740]  Torx.
[02:57.740 --> 02:58.740]  Hex.
[02:58.740 --> 02:59.740]  Allen.
[02:59.740 --> 03:00.740]  Robertson.
[03:00.740 --> 03:01.740]  Tri-wing.
[03:01.740 --> 03:02.740]  Torxset.
[03:02.740 --> 03:03.740]  Spannerhead.
[03:03.740 --> 03:04.740]  Triple square.
[03:04.740 --> 03:05.740]  XZN.
[03:05.740 --> 03:06.740]  Others are crosshead.
[03:06.740 --> 03:07.740]  Crosshead.
[03:07.740 --> 03:08.740]  Spannerhead.
[03:08.740 --> 03:09.740]  Crosshead.
[03:09.740 --> 03:10.740]  Crosshead.
[03:10.740 --> 03:11.740]  Torxset.
[03:11.740 --> 03:12.740]  Spannerhead.
[03:12.740 --> 03:13.740]  Torxset.
[03:13.740 --> 03:14.740]  Torxset.
[03:14.740 --> 03:15.740]  Spannerhead.
[03:15.740 --> 03:16.740]  Torxset.
[03:16.740 --> 03:17.740]  Torxset.
[03:17.740 --> 03:18.740]  Torxset.
[03:18.740 --> 03:19.740]  Torxset.
[03:19.740 --> 03:20.740]  Torxset.
[03:20.740 --> 03:21.740]  Torxset.
[03:21.740 --> 03:22.740]  Torxset.
[03:22.740 --> 03:23.740]  Torxset.
[03:23.740 --> 03:33.940]  polydrive, spline drive, double hex. Many images accompanied this info box. This
[03:33.940 --> 03:44.980]  page was last modified on the 9th of April 2008 at 1704. All text is available
[03:44.980 --> 03:51.240]  under the terms of the GNU free documentation license. See copyrights for
[03:51.240 --> 03:57.920]  details. Wikipedia is a registered trademark of the Wikimedia Foundation
[03:57.920 --> 04:08.200]  Incorporated, a US registered 501c3 tax-deductible nonprofit charity. This
[04:08.200 --> 04:14.100]  sound file and all text in the article are licensed under the GNU free
[04:14.100 --> 04:19.320]  documentation license available at
[04:19.320 --> 04:20.320]  www.gnu.org.
[04:21.240 --> 04:31.180]  www.gnu.org slash copyleft slash fdl dot html

Here is v2 for reference:

 ggerganov ▶ gg-studio ▶ SSH ▶ ~/development/github/whisper ▶
 11:58:03 ▶ main ▶ 5? ▶ $ ▶ whisper ../whisper.cpp/samples/hp0.wav --model large-v2
/opt/homebrew/lib/python3.11/site-packages/whisper/transcribe.py:115: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:13.520]  Henry F. Phillips, from Wikipedia, The Free Encyclopedia, at en.wikipedia.org.
[00:13.520 --> 00:21.360]  Henry F. Phillips, from Wikipedia, The Free Encyclopedia.
[00:21.360 --> 00:33.640]  Henry F. Phillips, 1890-1958, a U.S. businessman from Portland, Oregon, has the honor of having
[00:33.640 --> 00:40.280]  the Phillips head screw and screwdriver named after him.
[00:40.280 --> 00:47.640]  The importance of the crosshead screw design lies in its self-centering property, useful
[00:47.640 --> 00:54.040]  on automated production lines that use powered screwdrivers.
[00:54.040 --> 01:00.280]  Phillips' major contribution was in driving the crosshead concept forward to the point
[01:00.280 --> 01:06.240]  where it was adopted by screwmakers and automobile companies.
[01:06.240 --> 01:19.520]  Although he received patents for the design in 1936, U.S. Patent No. 2,046,343, U.S.
[01:19.520 --> 01:35.800]  Patents 2,046,837 to 2,046,840, it was so widely copied that by 1949 Phillips lost his
[01:35.800 --> 01:38.280]  patent.
[01:38.280 --> 01:44.360]  The American Screw Company was responsible for devising a means of manufacturing the
[01:44.360 --> 01:51.200]  screw, and successfully patented and licensed their method.
[01:51.200 --> 01:57.720]  Other screwmakers of the 1930s dismissed the Phillips concept since it calls for a relatively
[01:57.720 --> 02:05.040]  complex recessed socket shape in the head of the screw, as distinct from the simple
[02:05.040 --> 02:09.600]  milled slot of a slotted-type screw.
[02:09.600 --> 02:17.040]  The Phillips Screw Company and the American Screw Company went on to devise the posidrive
[02:17.040 --> 02:23.840]  screw, which differs from the Phillips in that it is designed to accommodate greater
[02:23.840 --> 02:27.280]  torque than the Phillips.
[02:27.280 --> 02:35.520]  An image accompanied this article, captioned, Phillips Screw Head.
[02:35.520 --> 02:40.440]  The following is an infobox which accompanies this article.
[02:40.440 --> 02:46.880]  Infobox, part of the series on screw drive types.
[02:46.880 --> 02:53.800]  Slotted, commonly, erroneously, flathead.
[02:53.800 --> 02:58.000]  Phillips, crosshead.
[02:58.000 --> 03:02.000]  Posidrive, supadrive.
[03:02.000 --> 03:04.000]  Torx.
[03:04.480 --> 03:06.480]  Hex, Allen.
[03:06.480 --> 03:08.480]  Robertson.
[03:08.480 --> 03:10.480]  Tri-wing.
[03:10.480 --> 03:12.480]  Torxset.
[03:12.480 --> 03:14.480]  Spannerhead.
[03:14.480 --> 03:16.480]  Triple-square, XZN.
[03:16.480 --> 03:28.480]  Others, polydrive, splinedrive, double-hex.
[03:28.480 --> 03:32.480]  Many images accompanied this infobox.
[03:32.960 --> 03:42.960]  This page was last modified on the 9th of April, 2008, at 1704.
[03:42.960 --> 03:48.960]  All text is available under the terms of the GNU Free Documentation License.
[03:48.960 --> 03:52.960]  See copyrights for details.
[03:52.960 --> 04:00.960]  Wikipedia is a registered trademark of the Wikimedia Foundation, Inc., a U.S. registered
[04:01.440 --> 04:07.440]  501c3 tax-deductible non-profit charity.
[04:07.440 --> 04:15.440]  This sound file and all text in the article are licensed under the GNU Free Documentation License,
[04:15.920 --> 04:23.920]  available at www.gnu.org
[04:23.920 --> 04:31.920]  slash copyleft slash fdl dot html.

So using the same argument as you do, I can say that the OpenAI implementation has a bug. Or what is more likely - the v3 model is simply broken

WilliamTambellini · 2024-06-03T18:05:26Z

Tks @ggerganov
Good catch: indeed the v2 model does nt seem to hallucinate on my audio file.
That being said the quality of v2 is much lower than v3:

so using v2 is not a solution neither.
The facts are still that:

the original whisper project does not hallucinate on these examples using the v3 model
whispercpp 1.5.4 (and previous) dont hallucinate on these audio with the v3 model

Would it be a bug in the ggml export of v3?
Best

MathiasSchindler · 2024-07-12T15:29:09Z

I added in #1949 that increasing the beam size with -bs 8 makes the large amount of hallucinations and repetitions go away.

backmind · 2024-07-21T15:57:47Z

I added in #1949 that increasing the beam size with -bs 8 makes the large amount of hallucinations and repetitions go away.

I tried with -bs 8 and, sadly no better result. Next, i'm trying with large-v2, let see.

WilliamTambellini · 2024-07-31T21:29:01Z

Still same issue on my side too, whatever the number of beams.
That being said, the hallucinations dont show up at the beginning of the audio, always after (about) 2mn of transcription.
Would it be a bug in window shifting?

jensdraht1999 · 2024-09-01T07:29:10Z

@ggerganov I can tell you, that Large V3 is even a problem on another port this project. It has been tried by some members and all are of the opinion, that Large V3 is worse.

Also here is a list of people complaining about the quality of Large V3:

#2017
#1825
#1592
#1507
#1497

ggerganov · 2024-09-01T07:49:03Z

@jensdraht1999 Thank you for confirming. My guess is that V3 heavily relies on a VAD pre-processing in order to produce good results.

The3IC mentioned this issue Jun 3, 2024

Some strange behavior still with large-v3 intel/openvino-plugins-ai-audacity#206

Open

au-voltzzz mentioned this issue Jun 13, 2024

Degraded quality with timestamps disabled #2186

Open

mkiol mentioned this issue Sep 15, 2024

Strange insertion of words not resembling what I spoke, even after I stop speaking! mkiol/dsnote#158

Open

MathiasSchindler mentioned this issue Oct 1, 2024

whisper_full_with_state: failed to decode #2334

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latest 1.6.2 release substantial increase in hallucinations for large-v3 on CUDA #2191

Latest 1.6.2 release substantial increase in hallucinations for large-v3 on CUDA #2191

magnacartatron commented May 29, 2024

WilliamTambellini commented May 29, 2024

WilliamTambellini commented May 29, 2024

WilliamTambellini commented May 29, 2024

ggerganov commented May 30, 2024

magnacartatron commented May 30, 2024 •

edited

Loading

MathiasSchindler commented May 30, 2024

ggerganov commented May 30, 2024

WilliamTambellini commented May 30, 2024

WilliamTambellini commented May 30, 2024

WilliamTambellini commented May 30, 2024

josharian commented May 30, 2024

ggerganov commented May 30, 2024

WilliamTambellini commented May 30, 2024

magnacartatron commented May 31, 2024

ggerganov commented May 31, 2024 •

edited

Loading

WilliamTambellini commented Jun 3, 2024 •

edited

Loading

MathiasSchindler commented Jul 12, 2024

backmind commented Jul 21, 2024

WilliamTambellini commented Jul 31, 2024

jensdraht1999 commented Sep 1, 2024

ggerganov commented Sep 1, 2024

Latest 1.6.2 release substantial increase in hallucinations for large-v3 on CUDA #2191

Latest 1.6.2 release substantial increase in hallucinations for large-v3 on CUDA #2191

Comments

magnacartatron commented May 29, 2024

WilliamTambellini commented May 29, 2024

WilliamTambellini commented May 29, 2024

WilliamTambellini commented May 29, 2024

ggerganov commented May 30, 2024

magnacartatron commented May 30, 2024 • edited Loading

MathiasSchindler commented May 30, 2024

ggerganov commented May 30, 2024

WilliamTambellini commented May 30, 2024

WilliamTambellini commented May 30, 2024

WilliamTambellini commented May 30, 2024

josharian commented May 30, 2024

ggerganov commented May 30, 2024

WilliamTambellini commented May 30, 2024

magnacartatron commented May 31, 2024

ggerganov commented May 31, 2024 • edited Loading

WilliamTambellini commented Jun 3, 2024 • edited Loading

MathiasSchindler commented Jul 12, 2024

backmind commented Jul 21, 2024

WilliamTambellini commented Jul 31, 2024

jensdraht1999 commented Sep 1, 2024

ggerganov commented Sep 1, 2024

magnacartatron commented May 30, 2024 •

edited

Loading

ggerganov commented May 31, 2024 •

edited

Loading

WilliamTambellini commented Jun 3, 2024 •

edited

Loading