Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLBlast fails on context lengths above 2048 after merging #4256 #4296

Closed
LostRuins opened this issue Dec 2, 2023 · 15 comments · Fixed by #4307
Closed

CLBlast fails on context lengths above 2048 after merging #4256 #4296

LostRuins opened this issue Dec 2, 2023 · 15 comments · Fixed by #4307
Labels
bug Something isn't working

Comments

@LostRuins
Copy link
Collaborator

Inference with CLBlast fails with a segfault after the commit that merged #4256 on context sizes above 2k when all GPU layers are offloaded.

Command line:
C:\test\llama-b1601-bin-win-clblast-x64>main.exe -m E:\LLaMA\models\airoboros-mistral2.2-7b.Q4_K_S.gguf -c 4096 -b 512 -n 32 -ngl 33 -f C:\test\test.txt

main: build = 1601 (5a7d312)
main: built with MSVC 19.37.32826.1 for x64
main: seed  = 1701534899
ggml_opencl: selecting platform: 'NVIDIA CUDA'
ggml_opencl: selecting device: 'NVIDIA GeForce RTX 2060'
ggml_opencl: device FP16 support: false

Result:
Prompt processing starts, and then segfaults halfway around the 2k token mark, before generation begins. Only if the prompt is short enough (less than 2k tokens) it appears to work.

@ggerganov
Copy link
Owner

Does it work with this patch:

diff --git a/llama.cpp b/llama.cpp
index fd905ade..69c45c3f 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -3813,7 +3813,7 @@ static struct ggml_tensor * llm_build_kqv(
     struct ggml_tensor * kq = ggml_mul_mat(ctx, k, q);
     cb(kq, "kq", il);
 
-    if (max_alibi_bias > 0.0f) {
+    if (true) {
         // temporary branch until we figure out how to handle ggml_alibi through ggml_add
         kq = ggml_scale(ctx, kq, kq_scale);
         cb(kq, "kq_scaled", il);

@LostRuins
Copy link
Collaborator Author

Nope, unfortunately this did not fix the issue, it still segfaults around the same point.

@ggerganov
Copy link
Owner

Hm, I don't see what could have affected the OpenCL backend in that change.
Any extra information that you can provide (e.g. stack trace)? Does it depend on the value of -ngl?

@LostRuins
Copy link
Collaborator Author

LostRuins commented Dec 3, 2023

There's no stack trace. In fact, there's no printout whatsoever, the program simply halts. I tried it again with 0 layers offloaded and it seems to happen too, it still crashes at the same place. CUDA is fine, however.

Here's a video of b1579 vs b1601, showing the differences. The video has been sped up by 2x, but you can rewind/pause it at any point to review.

demo.mp4

The test text file I used for input is the first 5 sections of the GPL license, which you can find here:
test.txt

I am able to repro this consistently as it crashes at the same place. Reducing the prompt to a shorter one allows it to work.
I am on Windows 10, running with RTX2060.

@AlpinDale
Copy link
Contributor

AlpinDale commented Dec 3, 2023

Can confirm this happens for me too. Same command and prompt as @LostRuins. Hardware is RTX 2070S and Intel i7-8700, and I'm using Linux 6.5.9. Happens with -ngl 0 and -ngl 99. The error I get is:

free(): invalid next size (normal)
zsh: IOT instruction (core dumped)

Different error followed by a segfault with -ngl 32 (7B GGUF model):

ggml_opencl: clSetKernelArg(*to_fp32_cl, 0, sizeof(cl_mem), &d_Q) error -38 at ggml-opencl.cpp:1733

@ggerganov
Copy link
Owner

And b1600 works?

@AlpinDale
Copy link
Contributor

I tested more, and I get a coredump with lower -c values too. Tried 2048 and 1600.
It's an IOT instruction core dump.

@LostRuins
Copy link
Collaborator Author

LostRuins commented Dec 3, 2023

Reverting this specific commit: ggml : add ggml_soft_max_ext (#4256) seems to work.

@slaren
Copy link
Collaborator

slaren commented Dec 3, 2023

The free error suggests that this is a memory corruption issue. The changes in #4256 are not likely to be related. Running this with an ASAN build (enable LLAMA_SANITIZE_ADDRESS and LLAMA_SANITIZE_UNDEFINED) may show the source of the issue.

@ggerganov
Copy link
Owner

I'm able to reproduce - looking into it

@AlpinDale
Copy link
Contributor

AlpinDale commented Dec 3, 2023

I built with ASan, here's the error traceback I get when running with the command:

./main -m ~/models/openhermes-2-mistral-7b.Q6_K.gguf -c 4096 -b 512 -n 32 -ngl 99 -f test.txt

Error:

Log start
main: build = 1604 (33e171d)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1701609933
ggml_opencl: clGetPlatformIDs(NPLAT, platform_ids, &n_platforms) error -1001 at ggml-opencl.cpp:965

=================================================================
==2203735==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 264 byte(s) in 1 object(s) allocated from:
    #0 0x7f12666e0cc1 in __interceptor_calloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x7f1260a5dee0  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205dee0)
    #2 0x7f12608e566a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x1ee566a)
    #3 0x7f1265c02fd4  (/opt/cuda/lib64/libOpenCL.so.1+0x2fd4) (BuildId: b3217362255db6f1188e7596454ffe8bc4606b53)

Direct leak of 56 byte(s) in 1 object(s) allocated from:
    #0 0x7f12666e0cc1 in __interceptor_calloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x7f1260a5b80a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205b80a)
    #2 0x7f1260a5cfaa  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205cfaa)
    #3 0x7f12608e566a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x1ee566a)
    #4 0x7f1265c02fd4  (/opt/cuda/lib64/libOpenCL.so.1+0x2fd4) (BuildId: b3217362255db6f1188e7596454ffe8bc4606b53)

Direct leak of 56 byte(s) in 1 object(s) allocated from:
    #0 0x7f12666e0cc1 in __interceptor_calloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x7f1260a5b8ae  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205b8ae)
    #2 0x7f1260a5cfaa  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205cfaa)
    #3 0x7f12608e566a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x1ee566a)
    #4 0x7f1265c02fd4  (/opt/cuda/lib64/libOpenCL.so.1+0x2fd4) (BuildId: b3217362255db6f1188e7596454ffe8bc4606b53)

Direct leak of 56 byte(s) in 1 object(s) allocated from:
    #0 0x7f12666e0cc1 in __interceptor_calloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x7f1260a5b936  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205b936)
    #2 0x7f1260a5cfaa  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205cfaa)
    #3 0x7f12608e566a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x1ee566a)
    #4 0x7f1265c02fd4  (/opt/cuda/lib64/libOpenCL.so.1+0x2fd4) (BuildId: b3217362255db6f1188e7596454ffe8bc4606b53)

Direct leak of 32 byte(s) in 1 object(s) allocated from:
    #0 0x7f12666e1359 in __interceptor_malloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:69
    #1 0x7f1260b70de2  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x2170de2)
    #2 0x7f1260a576bf  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x20576bf)
    #3 0x7f1260a5d18f  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205d18f)
    #4 0x7f12608e566a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x1ee566a)
    #5 0x7f1265c02fd4  (/opt/cuda/lib64/libOpenCL.so.1+0x2fd4) (BuildId: b3217362255db6f1188e7596454ffe8bc4606b53)

Direct leak of 32 byte(s) in 1 object(s) allocated from:
    #0 0x7f12666e1359 in __interceptor_malloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:69
    #1 0x7f1260b70de2  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x2170de2)
    #2 0x7f1260a576db  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x20576db)
    #3 0x7f1260a5d18f  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205d18f)
    #4 0x7f12608e566a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x1ee566a)
    #5 0x7f1265c02fd4  (/opt/cuda/lib64/libOpenCL.so.1+0x2fd4) (BuildId: b3217362255db6f1188e7596454ffe8bc4606b53)

Indirect leak of 320 byte(s) in 1 object(s) allocated from:
    #0 0x7f12666e0cc1 in __interceptor_calloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:77
    #0 0x7f12666e0cc1 in __interceptor_calloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x7f1260b70df9  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x2170df9)
    #2 0x7f1260a576db  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x20576db)
    #3 0x7f1260a5d18f  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205d18f)
    #4 0x7f12608e566a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x1ee566a)
    #5 0x7f1265c02fd4  (/opt/cuda/lib64/libOpenCL.so.1+0x2fd4) (BuildId: b3217362255db6f1188e7596454ffe8bc4606b53)

Indirect leak of 320 byte(s) in 1 object(s) allocated from:
    #0 0x7f12666e0cc1 in __interceptor_calloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x7f1260b70df9  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x2170df9)
    #2 0x7f1260a576bf  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x20576bf)
    #3 0x7f1260a5d18f  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x205d18f)
    #4 0x7f12608e566a  (/home/alpindale/AI-Stuff/tools/llama.cpp/main+0x1ee566a)
    #5 0x7f1265c02fd4  (/opt/cuda/lib64/libOpenCL.so.1+0x2fd4) (BuildId: b3217362255db6f1188e7596454ffe8bc4606b53)

SUMMARY: AddressSanitizer: 1136 byte(s) leaked in 8 allocation(s).

Reverted ef47ec18da469423c276b683dd9b5741cee7023e (#4256) and re-trying now.

@LostRuins LostRuins added bug Something isn't working and removed bug-unconfirmed labels Dec 3, 2023
@ggerganov
Copy link
Owner

@AlpinDale When running with ASAN, you need to add this env variable: ASAN_OPTIONS=protect_shadow_gap=0 ./main .. to go through these bogus errors on init.

Doing that, I now get the following sanitizer errors, confirming a bug in ggml.c that I introduced in #4256

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1,100, frequency_penalty = 0,000, presence_penalty = 0,000
	top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
generate: n_ctx = 4096, n_batch = 512, n_predict = 32, n_keep = 0


 GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007

Copyright © 2007 Free Software Foundation, Inc. <https://fsf.org/>

Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.

Preamble
The GNU General Public License is a free, copyleft license for software and other kinds of works.

The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too.

When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.

To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others.

For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.

Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it.

For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions.

Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users.

Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free.

The precise terms and conditions for copying, distribution and modification follow.

TERMS AND CONDITIONS
0. Definitions.
“This Licenserefers to version 3 of the GNU General Public License.

“Copyrightalso means copyright-like laws that apply to other kinds of works, such as semiconductor masks.

“The Programrefers to any copyrightable work licensed under this License. Each licensee is addressed asyou”. “Licenseesandrecipientsmay be individuals or organizations.

Tomodifya work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called amodified versionof the earlier work or a workbased onthe earlier work.

Acovered workmeans either the unmodified Program or a work based on the Program.

Topropagatea work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes=================================================================
==364805==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x62d000fc6580 at pc 0x5620bf18802b bp 0x7fe2cf3f2840 sp 0x7fe2cf3f2830
WRITE of size 4 at 0x62d000fc6580 thread T28
    #0 0x5620bf18802a in ggml_vec_cpy_f32 /home/ggerganov/development/github/llama.cpp/ggml.c:1158
    #1 0x5620bf22385d in ggml_compute_forward_soft_max_f32 /home/ggerganov/development/github/llama.cpp/ggml.c:10614
    #2 0x5620bf2244aa in ggml_compute_forward_soft_max /home/ggerganov/development/github/llama.cpp/ggml.c:10668
    #3 0x5620bf25fbbe in ggml_compute_forward /home/ggerganov/development/github/llama.cpp/ggml.c:13905
    #4 0x5620bf27e361 in ggml_graph_compute_thread /home/ggerganov/development/github/llama.cpp/ggml.c:15860
    #5 0x7fe42b494ac2 in start_thread nptl/pthread_create.c:442
    #6 0x7fe42b526a3f  (/lib/x86_64-linux-gnu/libc.so.6+0x126a3f)

0x62d000fc6580 is located 0 bytes to the right of 33152-byte region [0x62d000fbe400,0x62d000fc6580)
allocated by thread T0 here:
    #0 0x7fe42ccb61e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x5620bf14270a in __gnu_cxx::new_allocator<unsigned char>::allocate(unsigned long, void const*) /usr/include/c++/11/ext/new_allocator.h:127
    #2 0x5620bf11ee72 in std::allocator_traits<std::allocator<unsigned char> >::allocate(std::allocator<unsigned char>&, unsigned long) /usr/include/c++/11/bits/alloc_traits.h:464
    #3 0x5620bf0ea3eb in std::_Vector_base<unsigned char, std::allocator<unsigned char> >::_M_allocate(unsigned long) /usr/include/c++/11/bits/stl_vector.h:346
    #4 0x5620bf0a3ffb in std::vector<unsigned char, std::allocator<unsigned char> >::_M_default_append(unsigned long) /usr/include/c++/11/bits/vector.tcc:635
    #5 0x5620bf06d1ab in std::vector<unsigned char, std::allocator<unsigned char> >::resize(unsigned long) /usr/include/c++/11/bits/stl_vector.h:940
    #6 0x5620bef398d0 in ggml_graph_compute_helper /home/ggerganov/development/github/llama.cpp/llama.cpp:668
    #7 0x5620bef8f6b2 in llama_decode_internal /home/ggerganov/development/github/llama.cpp/llama.cpp:5577
    #8 0x5620befc9a09 in llama_decode /home/ggerganov/development/github/llama.cpp/llama.cpp:9462
    #9 0x5620bedd4eb5 in llama_init_from_gpt_params(gpt_params&) /home/ggerganov/development/github/llama.cpp/common/common.cpp:996
    #10 0x5620bed77fc5 in main /home/ggerganov/development/github/llama.cpp/examples/main/main.cpp:187
    #11 0x7fe42b429d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

Thread T28 created by T0 here:
    #0 0x7fe42cc58685 in __interceptor_pthread_create ../../../../src/libsanitizer/asan/asan_interceptors.cpp:216
    #1 0x5620bf282b56 in ggml_graph_compute /home/ggerganov/development/github/llama.cpp/ggml.c:16094
    #2 0x5620bef3994f in ggml_graph_compute_helper /home/ggerganov/development/github/llama.cpp/llama.cpp:672
    #3 0x5620bef8f6b2 in llama_decode_internal /home/ggerganov/development/github/llama.cpp/llama.cpp:5577
    #4 0x5620befc9a09 in llama_decode /home/ggerganov/development/github/llama.cpp/llama.cpp:9462
    #5 0x5620bed8b2fa in main /home/ggerganov/development/github/llama.cpp/examples/main/main.cpp:605
    #6 0x7fe42b429d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/ggerganov/development/github/llama.cpp/ggml.c:1158 in ggml_vec_cpy_f32
Shadow bytes around the buggy address:
  0x0c5a801f0c60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5a801f0c70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5a801f0c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5a801f0c90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5a801f0ca0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c5a801f0cb0:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5a801f0cc0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5a801f0cd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5a801f0ce0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5a801f0cf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5a801f0d00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==364805==ABORTING

@ggerganov
Copy link
Owner

Please confirm that #4307 works

@LostRuins
Copy link
Collaborator Author

Sorry I couldn't help more with the debugging. Anyway #4307 seems to work for me. The segfault no longer occurs.

@ggerganov
Copy link
Owner

No problem - thank you very much for reporting this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants