-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPEX-LLM(llama.cpp) met core dump when run Qwen-7B-Q4_K_M.gguf on Intel ARC770 #11260
Comments
上面不清楚的地方,重新贴一下. Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml-sycl.cpp, line:15299, func:operator() |
Hi @jianweimama, I failed to reproduce the error on our machine. Please follow this guide to reinstall oneAPI: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id1 (Intel® oneAPI Base Toolkit 2024.0 installation methods: part), and try again. Here is my log:
|
Updated GPU driver with recommended I915_24.1.11_PSB_240117.14, it works now. |
IPEX-LLM Llama cpp操作步骤如下:
1.Install OneAPI
#wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
#echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
#sudo apt update
#sudo apt install intel-oneapi-common-vars=2024.0.0-49406
intel-oneapi-common-oneapi-vars=2024.0.0-49406
intel-oneapi-diagnostics-utility=2024.0.0-49093
intel-oneapi-compiler-dpcpp-cpp=2024.0.2-49895
intel-oneapi-dpcpp-ct=2024.0.0-49381
intel-oneapi-mkl=2024.0.0-49656
intel-oneapi-mkl-devel=2024.0.0-49656
intel-oneapi-mpi=2021.11.0-49493
intel-oneapi-mpi-devel=2021.11.0-49493
intel-oneapi-dal=2024.0.1-25
intel-oneapi-dal-devel=2024.0.1-25
intel-oneapi-ippcp=2021.9.1-5
intel-oneapi-ippcp-devel=2021.9.1-5
intel-oneapi-ipp=2021.10.1-13
intel-oneapi-ipp-devel=2021.10.1-13
intel-oneapi-tlt=2024.0.0-352
intel-oneapi-ccl=2021.11.2-5
intel-oneapi-ccl-devel=2021.11.2-5
intel-oneapi-dnnl-devel=2024.0.0-49521
intel-oneapi-dnnl=2024.0.0-49521
intel-oneapi-tcm-1.0=1.0.0-435
#wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
#bash Miniforge3-Linux-x86_64.sh
conda create -n llm python=3.11
conda activate llm
#pip install --pre --upgrade ipex-llm[cpp]
4.Setup for running llama.cpp
#mkdir llama-cpp
#cd llama-cpp
(llm) llama-cpp# init-llama-cpp
(llm) llama-cpp# ls
baby-llama beam-search convert-llama2c-to-ggml export-lora gguf-py infill lookahead main perplexity quantize-stats simple train-text-from-scratch
batched benchmark convert.py finetune gritlm llama-bench lookup parallel q8dot save-load-state speculative vdot
batched-bench convert-hf-to-gguf.py embedding gguf imatrix llava-cli ls-sycl-device passkey quantize server tokenize
#source /opt/intel/oneapi/setvars.sh
#export SYCL_CACHE_PERSISTENT=1
6.Run the quantized model
(llm)llama-cpp# ./main -m Qwen-7B-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color
Log start
main: build = 1 (9140e0f)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.0 (2024.0.0.20231017) for x86_64-unknown-linux-gnu
main: seed = 1717580760
llama_model_loader: loaded meta data with 20 key-value pairs and 259 tensors from Qwen-7B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen
llama_model_loader: - kv 1: general.name str = Qwen
llama_model_loader: - kv 2: qwen.context_length u32 = 8192
llama_model_loader: - kv 3: qwen.block_count u32 = 32
llama_model_loader: - kv 4: qwen.embedding_length u32 = 4096
llama_model_loader: - kv 5: qwen.feed_forward_length u32 = 22016
llama_model_loader: - kv 6: qwen.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 7: qwen.rope.dimension_count u32 = 128
llama_model_loader: - kv 8: qwen.attention.head_count u32 = 32
llama_model_loader: - kv 9: qwen.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 15
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 151643
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 97 tensors
llama_model_loader: - type q4_K: 113 tensors
llama_model_loader: - type q5_K: 32 tensors
llama_model_loader: - type q6_K: 17 tensors
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 22016
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 7.72 B
llm_load_print_meta: model size = 4.56 GiB (5.07 BPW)
llm_load_print_meta: general.name = Qwen
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151643 '<|endoftext|>'
llm_load_print_meta: UNK token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 3 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
line to your configuration file "/root/.config/gdb/gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/root/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007d34e78ea42f in __GI___wait4 (pid=11456, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007d34e78ea42f in __GI___wait4 (pid=11456, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x0000000000635d16 in ggml_sycl_mul_mat(ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#2 0x0000000000631737 in ggml_sycl_compute_forward(ggml_compute_params*, ggml_tensor*) ()
#3 0x00000000006f599f in ggml_backend_sycl_graph_compute(ggml_backend*, ggml_cgraph*) ()
#4 0x00000000005e5698 in ggml_backend_sched_graph_compute_async ()
#5 0x00000000004e7f0c in llama_decode ()
#6 0x000000000044cc0c in llama_init_from_gpt_params(gpt_params&) ()
#7 0x000000000043670e in main ()
[Inferior 1 (process 11321) detached]
Aborted (core dumped)
The text was updated successfully, but these errors were encountered: