Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-filled WAV give hallucination and wrong duration #1881

Open
ukolovda opened this issue Feb 20, 2024 · 3 comments
Open

Zero-filled WAV give hallucination and wrong duration #1881

ukolovda opened this issue Feb 20, 2024 · 3 comments

Comments

@ukolovda
Copy link

I try process WAV file with zeroes in Data section. File duration is 1,2 seconds (attached it).

Whisper.cpp give hallucination (and wrong duration).

zeroes.zip

$ ./main -m ./models/ggml-large-v3.bin -l ru --threads 8 -mc 0 samples/zeroes.wav

whisper_init_from_file_with_params_no_state: loading model from './models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    CUDA0 total size =  3094.36 MB
whisper_model_load: model size    = 3094.36 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   36.26 MB
whisper_init_state: compute buffer (encode) =  926.66 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =  209.26 MB

system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/zeroes.wav' (19200 samples, 1.2 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = ru, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:29.980]   Продолжение следует...


whisper_print_timings:     load time =   685.11 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     4.86 ms
whisper_print_timings:   sample time =    24.48 ms /    79 runs (    0.31 ms per run)
whisper_print_timings:   encode time =   120.78 ms /     1 runs (  120.78 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   323.14 ms /    77 runs (    4.20 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1164.00 ms
$ ./main -m ./models/ggml-large-v2.bin -l ru --threads 8 -mc 0 samples/zeroes.wav
whisper_init_from_file_with_params_no_state: loading model from './models/ggml-large-v2.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    CUDA0 total size =  3093.99 MB
whisper_model_load: model size    = 3093.99 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   34.82 MB
whisper_init_state: compute buffer (encode) =  926.66 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =  209.26 MB

system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/zeroes.wav' (19200 samples, 1.2 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = ru, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:04.000]   Редактор субтитров А.Семкин Корректор А.Егорова


whisper_print_timings:     load time =  2376.23 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =     5.14 ms
whisper_print_timings:   sample time =    50.08 ms /   152 runs (    0.33 ms per run)
whisper_print_timings:   encode time =   238.64 ms /     1 runs (  238.64 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   821.07 ms /   148 runs (    5.55 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  3498.43 ms
$ ./main -m ./models/ggml-large-v3.bin -l ru --threads 8 -mc 0 samples/zeroes.wav -ng
whisper_init_from_file_with_params_no_state: loading model from './models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
whisper_model_load:      CPU total size =  3094.36 MB
whisper_model_load: model size    = 3094.36 MB
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   36.26 MB
whisper_init_state: compute buffer (encode) =  926.66 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =  209.26 MB

system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/zeroes.wav' (19200 samples, 1.2 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = ru, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:29.980]   Субтитры создавал DimaTorzok


whisper_print_timings:     load time =   957.60 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     6.50 ms
whisper_print_timings:   sample time =    24.92 ms /    75 runs (    0.33 ms per run)
whisper_print_timings:   encode time =  4063.61 ms /     1 runs ( 4063.61 ms per run)
whisper_print_timings:   decode time =   565.81 ms /    10 runs (   56.58 ms per run)
whisper_print_timings:   batchd time =  1186.10 ms /    63 runs (   18.83 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  6809.96 ms

I check it on last master branch:

$   git describe --tags
v1.5.4-183-gb602819

I think, this is a bug.

@ukolovda ukolovda changed the title Zero-padded WAV give hallucination and wrong duration Zero-filled WAV give hallucination and wrong duration Feb 20, 2024
@misutoneko
Copy link

This seems to be dependent on the language, I see a similar effect with -l fi and several others.
My understanding is that the problem originates from the training data so in that sense it can only be worked around, not really fixed.
So the model doesn't give you a "russian silence" token, because there wasn't such thing in the training data to begin with.
It can perhaps give you an english or italian one, however it's a different set of tokens for each language.
But I suppose entropy or compression ratio should give a hint that this is a non-speech portion, even without involving the model?

Multilingual is a bit tricky anyways, because once you set the language you can't change it (as discussed in #1800).
So you can't really detect an "english silence" and then switch languages, unless you cut the sample into smaller pieces with VAD/demucs/whatever.
Btw I've actually tried giving the model multiple language tokens to see what happens then, but it didn't work very well.

@superchargez
Copy link

This seems to be dependent on the language, I see a similar effect with -l fi and several others. My understanding is that the problem originates from the training data so in that sense it can only be worked around, not really fixed. So the model doesn't give you a "russian silence" token, because there wasn't such thing in the training data to begin with. It can perhaps give you an english or italian one, however it's a different set of tokens for each language. But I suppose entropy or compression ratio should give a hint that this is a non-speech portion, even without involving the model?

Multilingual is a bit tricky anyways, because once you set the language you can't change it (as discussed in #1800). So you can't really detect an "english silence" and then switch languages, unless you cut the sample into smaller pieces with VAD/demucs/whatever. Btw I've actually tried giving the model multiple language tokens to see what happens then, but it didn't work very well.

I reached same conclusion about Urdu, model is limited and is not very good for low resource languages, and can't handle silence for Urdu, and I could not find any VAD model that did well with Urdu non speech either. So, I'm stuck with high WER.

@DenisBalan
Copy link

also having some weird sentences coming out of nowhere, russian lang
"Редактор субтитров А.Семкин Корректор А.Егорова"

found this list of hallucination as well

https://gist.github.com/waveletdeboshir/8bf52f04bf78018194f25b2390c08309

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants