ipex-llm Llama.cpp port inside ipex-llm Docker containers getting SIGBUS #10955

simonlui · 2024-05-07T10:38:59Z

This might be compute-runtime or kernel related but I am posting here first since I don't know. For getting the simplest reproduction, I pulled the Docker image from intelanalytics/ipex-llm-xpu:cpp-test which was recently published to the public but I had been using another Docker container for trying to run the Llama.cpp fork included inside the bigdl-core-cpp pip package and had the same error show up. I used the same command as in the Quickstart guide and I got this as output.

...
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  7605.33 MiB
llm_load_tensors:  SYCL_Host buffer size =   532.31 MiB
.
Thread 1 "main" received signal SIGBUS, Bus error.
...

GDB stacktrace shows the following.

Thread 1 "main" received signal SIGBUS, Bus error.
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:708
708	../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:708
#1  0x00007f72618ef82b in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#2  0x00007f72618fe3f8 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#3  0x00007f7261824f14 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#4  0x00007f727ef25bc4 in enqueueMemCopyHelper(ur_command_t, ur_queue_handle_t_*, void*, unsigned char, unsigned long, void const*, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**, bool) ()
   from /opt/intel/oneapi/compiler/2024.0/lib/libpi_level_zero.so
#5  0x00007f727ef2bf48 in urEnqueueUSMMemcpy () from /opt/intel/oneapi/compiler/2024.0/lib/libpi_level_zero.so
#6  0x00007f727ef4ed9b in piextUSMEnqueueMemcpy () from /opt/intel/oneapi/compiler/2024.0/lib/libpi_level_zero.so
#7  0x00007f727fcb452f in _pi_result sycl::_V1::detail::plugin::call_nocheck<(sycl::_V1::detail::PiApiKind)97, _pi_queue*, unsigned int, void*, void const*, unsigned long, unsigned long, _pi_event**, _pi_event**>(_pi_queue*, unsigned int, void*, void const*, unsigned long, unsigned long, _pi_event**, _pi_event**) const () from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#8  0x00007f727fca854f in sycl::_V1::detail::MemoryManager::copy_usm(void const*, std::shared_ptr<sycl::_V1::detail::queue_impl>, unsigned long, void*, std::vector<_pi_event*, std::allocator<_pi_event*> >, _pi_event**, std::shared_ptr<sycl::_V1::detail::event_impl> const&) () from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#9  0x00007f727fcfc8f2 in sycl::_V1::detail::queue_impl::memcpy(std::shared_ptr<sycl::_V1::detail::queue_impl> const&, void*, void const*, unsigned long, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&) () from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#10 0x00007f727fda5146 in sycl::_V1::queue::memcpy(void*, void const*, unsigned long, sycl::_V1::detail::code_location const&) () from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#11 0x00000000006b334e in ggml_backend_sycl_buffer_set_tensor(ggml_backend_buffer*, ggml_tensor*, void const*, unsigned long, unsigned long) ()
#12 0x00000000005017ba in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) ()
#13 0x00000000004ac199 in llama_model_load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, llama_model&, llama_model_params&) ()
#14 0x00000000004a8d91 in llama_load_model_from_file ()
#15 0x0000000000440456 in llama_init_from_gpt_params(gpt_params&) ()
#16 0x000000000042a8ce in main ()

If I run with SYCL_PI_TRACE=-1, I see this as the last snippet before the SIGBUS

...
---> piextUSMEnqueueMemcpy(
	<unknown> : 0x8e294c0
	<unknown> : 0
	<unknown> : 0xffffd556cbe54000
	<unknown> : 0x8e91270
	<unknown> : 16384
	<unknown> : 0
	pi_event * : 0[ nullptr ]
	pi_event * : 0x8e8ede8[ 0 ... ]
UR ---> TmpWaitList.createAndRetainUrZeEventList( NumEventsInWaitList, EventWaitList, Queue, UseCopyEngine)
UR <--- TmpWaitList.createAndRetainUrZeEventList( NumEventsInWaitList, EventWaitList, Queue, UseCopyEngine)(UR_RESULT_SUCCESS)
UR ---> Queue->Context->getAvailableCommandList(Queue, CommandList, UseCopyEngine, OkToBatch)
UR ---> Queue->insertStartBarrierIfDiscardEventsMode(CommandList)
UR <--- Queue->insertStartBarrierIfDiscardEventsMode(CommandList)(UR_RESULT_SUCCESS)
UR <--- Queue->Context->getAvailableCommandList(Queue, CommandList, UseCopyEngine, OkToBatch)(UR_RESULT_SUCCESS)
UR ---> createEventAndAssociateQueue(Queue, Event, CommandType, CommandList, IsInternal)
UR ---> EventCreate(Queue->Context, Queue, HostVisible.value(), Event)
UR <--- EventCreate(Queue->Context, Queue, HostVisible.value(), Event)(UR_RESULT_SUCCESS)
UR ---> urEventRetain(*Event)
UR <--- urEventRetain(*Event)(UR_RESULT_SUCCESS)
UR <--- createEventAndAssociateQueue(Queue, Event, CommandType, CommandList, IsInternal)(UR_RESULT_SUCCESS)

I am using Linux kernel 6.8.8 and I was under the impression that any kernel or compute-runtime issues had been fixed with regards to seeing sycl-ls output the GPU correctly and not seeing the kernel hang for a workload. Hope this is enough information to track the issue but I can provide full logs upon request.

The text was updated successfully, but these errors were encountered:

hzjane · 2024-05-08T02:39:55Z

This image is still under internal testing，we will update you with the latest image after development is completed.

simonlui · 2024-05-08T04:11:37Z

I understand that, but I am getting the same problem regardless of if I use this image or if I use my custom Docker container running the llama.cpp fork inside bigdl-core-cpp. Is bigdl-core-cpp or at least the preproduction version not usable given it is still going by the bigdl name when the project has changed its name to ipex-llm?

hzjane · 2024-05-08T08:25:32Z

I understand that, but I am getting the same problem regardless of if I use this image or if I use my custom Docker container running the llama.cpp fork inside bigdl-core-cpp. Is bigdl-core-cpp or at least the preproduction version not usable given it is still going by the bigdl name when the project has changed its name to ipex-llm?

Maybe this issue is caused by a higher version of linux kenel. We have validated the kenel version of 5.19.0-41-generic and 6.2.0 but not 6.8.8.

simonlui · 2024-05-08T16:33:59Z

I found what the problem was. I checked a few other issues, and one of the troubleshooting steps was to run the utility scripts in ipex-llm. I ran that and then found out that in my lscpi output, the addressable memory was limited to 256MB. That meant ReBAR was disabled on my system. I wondered why but it turns out I had forgotten to disable CSM after enabling it the other day when I was troubleshooting something unrelated the other day. Turning on ReBAR fixed the SIGBUS issue and allowed the llama.cpp fork to proceed as normal after a prolonged warmup. Not sure if there is a way to modify the utility script to detect if ReBAR is enabled or not in the script, but that may help worth diagnosing issues like this.
There was an unrelated issue with the application getting stuck on one core 100% with the compute runtime on kernel 6.8.5 or higher but I reverted back to kernel 6.8.4 for now until upstream figures out the issue and how to mitigate it without loss of performance and everything now works fine with the fork. Thanks!

jason-dai added the user issue label May 7, 2024

glorysdj assigned hzjane May 8, 2024

simonlui closed this as completed May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipex-llm Llama.cpp port inside ipex-llm Docker containers getting SIGBUS #10955

ipex-llm Llama.cpp port inside ipex-llm Docker containers getting SIGBUS #10955

simonlui commented May 7, 2024

hzjane commented May 8, 2024

simonlui commented May 8, 2024 •

edited

Loading

hzjane commented May 8, 2024

simonlui commented May 8, 2024

ipex-llm Llama.cpp port inside ipex-llm Docker containers getting SIGBUS #10955

ipex-llm Llama.cpp port inside ipex-llm Docker containers getting SIGBUS #10955

Comments

simonlui commented May 7, 2024

hzjane commented May 8, 2024

simonlui commented May 8, 2024 • edited Loading

hzjane commented May 8, 2024

simonlui commented May 8, 2024

simonlui commented May 8, 2024 •

edited

Loading