Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use Ollama on Intel Arc B580 #12652

Open
x1tan opened this issue Jan 5, 2025 · 1 comment
Open

Unable to use Ollama on Intel Arc B580 #12652

x1tan opened this issue Jan 5, 2025 · 1 comment
Assignees

Comments

@x1tan
Copy link

x1tan commented Jan 5, 2025

I'm currently trying to get my B580 to work with ollama in Docker/Podman. I'm using the latest intelanalytics/ipex-llm-inference-cpp-xpu image on a Fedora 41 host (CPU: AMD Ryzen 5 5600, RAM: 32 GB).

$ uname -a
Linux lab 6.12.7-200.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Dec 27 17:05:33 UTC 2024 x86_64 GNU/Linux
$ lspci -k
0a:00.0 VGA compatible controller: Intel Corporation Battlemage G21 [Intel Graphics]
	Subsystem: Intel Corporation Device 1100
	Kernel driver in use: xe
	Kernel modules: xe

If I set OLLAMA_NUM_GPU=999 as documented (tested with mistral:7b and qwen2.5:14b) I get a SYCL error:

root@5fb1794a72a1:/llm/ollama# ZES_ENABLE_SYSMAN=1 OLLAMA_NUM_GPU=999 ./ollama serve
2025/01/05 23:38:16 routes.go:1197: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_H</details>OST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-01-05T23:38:16.944+08:00 level=INFO source=images.go:753 msg="total blobs: 10"
time=2025-01-05T23:38:16.944+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> ollama/server.(*Server).ShowHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2025-01-05T23:38:16.945+08:00 level=INFO source=routes.go:1248 msg="Listening on [::]:11434 (version 0.4.6-ipexllm-20250105)"
time=2025-01-05T23:38:16.945+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama3587664585/runners
time=2025-01-05T23:38:16.985+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners=[ipex_llm]
[GIN] 2025/01/05 - 23:38:20 | 200 |      25.759µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/01/05 - 23:38:20 | 200 |    4.516226ms |       127.0.0.1 | POST     "/api/show"
time=2025-01-05T23:38:20.501+08:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2025-01-05T23:38:20.501+08:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2025-01-05T23:38:20.502+08:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2025-01-05T23:38:20.502+08:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2025-01-05T23:38:20.504+08:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2025-01-05T23:38:20.513+08:00 level=INFO source=server.go:105 msg="system memory" total="31.2 GiB" free="24.5 GiB" free_swap="8.0 GiB"
time=2025-01-05T23:38:20.513+08:00 level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[24.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.5 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[5.5 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.6 GiB" memory.weights.nonrepeating="105.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB"
time=2025-01-05T23:38:20.514+08:00 level=INFO source=server.go:401 msg="starting llama server" cmd="/tmp/ollama3587664585/runners/ipex_llm/ollama_llama_server --model /models/blobs/sha256-ff82381e2bea77d91c1b824c7afb83f6fb73e9f7de9dda631bcdbca564aa5435 --ctx-size 8192 --batch-size 512 --n-gpu-layers 999 --threads 6 --no-mmap --parallel 4 --port 46313"
time=2025-01-05T23:38:20.514+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T23:38:20.514+08:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-01-05T23:38:20.514+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T23:38:20.548+08:00 level=INFO source=runner.go:956 msg="starting go runner"
time=2025-01-05T23:38:20.548+08:00 level=INFO source=runner.go:957 msg=system info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=6
time=2025-01-05T23:38:20.549+08:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:46313"
llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from /models/blobs/sha256-ff82381e2bea77d91c1b824c7afb83f6fb73e9f7de9dda631bcdbca564aa5435 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Mistral-7B-Instruct-v0.3
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 32768
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32768]   = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32768]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32768]   = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 771
llm_load_vocab: token to piece cache size = 0.1731 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32768
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.25 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = Mistral-7B-Instruct-v0.3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 781 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size =    0.27 MiB
time=2025-01-05T23:38:20.766+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  3850.02 MiB
llm_load_tensors:  SYCL_Host buffer size =    72.00 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Graphics [0xe20b]|    1.6|    160|    1024|   32| 12168M|            1.3.31294|
llama_kv_cache_init:      SYCL0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.56 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =    96.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 2
time=2025-01-05T23:38:28.030+08:00 level=WARN source=runner.go:894 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel
could not create a primitive descriptor for a matmul primitive
Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/ollama-llama-cpp/ggml/src/ggml-sycl.cpp, line:3226, func:operator()
SYCL error: CHECK_TRY_ERROR(op(ctx, src0, src1, dst, src0_dd_i, src1_ddf_i, src1_ddq_i, dst_dd_i, dev[i].row_low, dev[i].row_high, src1_ncols, src1_padded_col_size, stream)): Meet error in this line code!
  in function ggml_sycl_op_mul_mat at /home/runner/_work/llm.cpp/llm.cpp/ollama-llama-cpp/ggml/src/ggml-sycl.cpp:3226
/home/runner/_work/llm.cpp/llm.cpp/ollama-llama-cpp/ggml/src/ggml-sycl/common.hpp:107: SYCL error
libollama_ggml.so(+0x7d877)[0x7f3ca2a7d877]
libollama_ggml.so(ggml_abort+0xd8)[0x7f3ca2a7d808]
libollama_ggml.so(+0x200b98)[0x7f3ca2c00b98]
libollama_ggml.so(+0x237118)[0x7f3ca2c37118]
libollama_ggml.so(_Z25ggml_sycl_compute_forwardR25ggml_backend_sycl_contextP11ggml_tensor+0x5ef)[0x7f3ca2c036ff]
libollama_ggml.so(+0x24e12f)[0x7f3ca2c4e12f]
libollama_ggml.so(ggml_backend_sched_graph_compute_async+0x548)[0x7f3ca2aed698]
libollama_llama.so(llama_decode+0xb53)[0x7f3ca4a7dd53]
/tmp/ollama3587664585/runners/ipex_llm/ollama_llama_server(_cgo_0deba22bda5f_Cfunc_llama_decode+0x4c)[0x55d72f10f88c]
/tmp/ollama3587664585/runners/ipex_llm/ollama_llama_server(+0xf8b01)[0x55d72eef8b01]
SIGABRT: abort
PC=0x7f3ca22429fc m=4 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 6 gp=0xc000007dc0 m=4 mp=0xc00006d808 [syscall]:
runtime.cgocall(0x55d72f10f840, 0xc0000e3c50)
	runtime/cgocall.go:157 +0x4b fp=0xc0000e3c28 sp=0xc0000e3bf0 pc=0x55d72ee9046b
ollama/llama/llamafile._Cfunc_llama_decode(0x7f3c3eb6a0a0, {0x9, 0x7f3c3c126860, 0x0, 0x0, 0x7f3c3c126890, 0x7f3c3c1268c0, 0x7f3c3eaeec60, 0x7f3c3c00c6f0, 0x0, ...})
	_cgo_gotypes.go:548 +0x52 fp=0xc0000e3c50 sp=0xc0000e3c28 pc=0x55d72ef8d9f2
ollama/llama/llamafile.(*Context).Decode.func1(0x7f3c3c126890?, 0x7f3c3c1268c0?)
	ollama/llama/llamafile/llama.go:121 +0xd8 fp=0xc0000e3d70 sp=0xc0000e3c50 pc=0x55d72ef900b8
ollama/llama/llamafile.(*Context).Decode(0x0?, 0x0?)
	ollama/llama/llamafile/llama.go:121 +0x13 fp=0xc0000e3db8 sp=0xc0000e3d70 pc=0x55d72ef8ff53
main.(*Server).loadModel(0xc0000c0120, {0x3e7, 0x0, 0x0, 0x0, {0x0, 0x0, 0x0}, 0xc00003a1a0, 0x0}, ...)
	ollama/llama/runner/runner.go:905 +0x3bd fp=0xc0000e3f10 sp=0xc0000e3db8 pc=0x55d72f10d25d
main.main.gowrap1()
	ollama/llama/runner/runner.go:990 +0xda fp=0xc0000e3fe0 sp=0xc0000e3f10 pc=0x55d72f10e95a
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc0000e3fe8 sp=0xc0000e3fe0 pc=0x55d72eef8e81
created by main.main in goroutine 1
	ollama/llama/runner/runner.go:990 +0xc6c

goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
runtime.gopark(0xc000046008?, 0x0?, 0xc0?, 0x61?, 0xc00003f898?)
	runtime/proc.go:402 +0xce fp=0xc00003f860 sp=0xc00003f840 pc=0x55d72eec70ae
runtime.netpollblock(0xc00003f8f8?, 0x2ee8fbc6?, 0xd7?)
	runtime/netpoll.go:573 +0xf7 fp=0xc00003f898 sp=0xc00003f860 pc=0x55d72eebf2f7
internal/poll.runtime_pollWait(0x7f3ca5030020, 0x72)
	runtime/netpoll.go:345 +0x85 fp=0xc00003f8b8 sp=0xc00003f898 pc=0x55d72eef3b45
internal/poll.(*pollDesc).wait(0x3?, 0x3fe?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00003f8e0 sp=0xc00003f8b8 pc=0x55d72ef43a67
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc0000ee080)
	internal/poll/fd_unix.go:611 +0x2ac fp=0xc00003f988 sp=0xc00003f8e0 pc=0x55d72ef44f2c
net.(*netFD).accept(0xc0000ee080)
	net/fd_unix.go:172 +0x29 fp=0xc00003fa40 sp=0xc00003f988 pc=0x55d72efb3b49
net.(*TCPListener).accept(0xc00007e1e0)
	net/tcpsock_posix.go:159 +0x1e fp=0xc00003fa68 sp=0xc00003fa40 pc=0x55d72efc487e
net.(*TCPListener).Accept(0xc00007e1e0)
	net/tcpsock.go:327 +0x30 fp=0xc00003fa98 sp=0xc00003fa68 pc=0x55d72efc3bd0
net/http.(*onceCloseListener).Accept(0xc0000c01b0?)
	<autogenerated>:1 +0x24 fp=0xc00003fab0 sp=0xc00003fa98 pc=0x55d72f0eade4
net/http.(*Server).Serve(0xc0000f4000, {0x55d72f416560, 0xc00007e1e0})
	net/http/server.go:3260 +0x33e fp=0xc00003fbe0 sp=0xc00003fab0 pc=0x55d72f0e1bfe
main.main()
	ollama/llama/runner/runner.go:1015 +0x10cd fp=0xc00003ff50 sp=0xc00003fbe0 pc=0x55d72f10e5cd
runtime.main()
	runtime/proc.go:271 +0x29d fp=0xc00003ffe0 sp=0xc00003ff50 pc=0x55d72eec6c7d
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc00003ffe8 sp=0xc00003ffe0 pc=0x55d72eef8e81

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:402 +0xce fp=0xc000066fa8 sp=0xc000066f88 pc=0x55d72eec70ae
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.forcegchelper()
	runtime/proc.go:326 +0xb8 fp=0xc000066fe0 sp=0xc000066fa8 pc=0x55d72eec6f38
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000066fe8 sp=0xc000066fe0 pc=0x55d72eef8e81
created by runtime.init.6 in goroutine 1
	runtime/proc.go:314 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:402 +0xce fp=0xc000067780 sp=0xc000067760 pc=0x55d72eec70ae
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.bgsweep(0xc000024230)
	runtime/mgcsweep.go:278 +0x94 fp=0xc0000677c8 sp=0xc000067780 pc=0x55d72eeb1bf4
runtime.gcenable.gowrap1()
	runtime/mgc.go:203 +0x25 fp=0xc0000677e0 sp=0xc0000677c8 pc=0x55d72eea6725
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc0000677e8 sp=0xc0000677e0 pc=0x55d72eef8e81
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0xc000024230?, 0x55d72f18e208?, 0x1?, 0x0?, 0xc000007340?)
	runtime/proc.go:402 +0xce fp=0xc000067f78 sp=0xc000067f58 pc=0x55d72eec70ae
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.(*scavengerState).park(0x55d72f5e0680)
	runtime/mgcscavenge.go:425 +0x49 fp=0xc000067fa8 sp=0xc000067f78 pc=0x55d72eeaf5e9
runtime.bgscavenge(0xc000024230)
	runtime/mgcscavenge.go:653 +0x3c fp=0xc000067fc8 sp=0xc000067fa8 pc=0x55d72eeafb7c
runtime.gcenable.gowrap2()
	runtime/mgc.go:204 +0x25 fp=0xc000067fe0 sp=0xc000067fc8 pc=0x55d72eea66c5
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000067fe8 sp=0xc000067fe0 pc=0x55d72eef8e81
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:204 +0xa5

goroutine 5 gp=0xc000007c00 m=nil [finalizer wait]:
runtime.gopark(0xc000066648?, 0x55d72ee9a025?, 0xa8?, 0x1?, 0xc0000061c0?)
	runtime/proc.go:402 +0xce fp=0xc000066620 sp=0xc000066600 pc=0x55d72eec70ae
runtime.runfinq()
	runtime/mfinal.go:194 +0x107 fp=0xc0000667e0 sp=0xc000066620 pc=0x55d72eea5767
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc0000667e8 sp=0xc0000667e0 pc=0x55d72eef8e81
created by runtime.createfing in goroutine 1
	runtime/mfinal.go:164 +0x3d

goroutine 7 gp=0xc0000f2000 m=nil [semacquire]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:402 +0xce fp=0xc000068e08 sp=0xc000068de8 pc=0x55d72eec70ae
runtime.goparkunlock(...)
	runtime/proc.go:408
runtime.semacquire1(0xc0000c0128, 0x0, 0x1, 0x0, 0x12)
	runtime/sema.go:160 +0x22c fp=0xc000068e70 sp=0xc000068e08 pc=0x55d72eed94cc
sync.runtime_Semacquire(0x0?)
	runtime/sema.go:62 +0x25 fp=0xc000068ea8 sp=0xc000068e70 pc=0x55d72eef5305
sync.(*WaitGroup).Wait(0x0?)
	sync/waitgroup.go:116 +0x48 fp=0xc000068ed0 sp=0xc000068ea8 pc=0x55d72ef13d88
main.(*Server).run(0xc0000c0120, {0x55d72f416ba0, 0xc00009a0a0})
	ollama/llama/runner/runner.go:315 +0x47 fp=0xc000068fb8 sp=0xc000068ed0 pc=0x55d72f109627
main.main.gowrap2()
	ollama/llama/runner/runner.go:995 +0x28 fp=0xc000068fe0 sp=0xc000068fb8 pc=0x55d72f10e848
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000068fe8 sp=0xc000068fe0 pc=0x55d72eef8e81
created by main.main in goroutine 1
	ollama/llama/runner/runner.go:995 +0xd3e

goroutine 8 gp=0xc0000f21c0 m=nil [IO wait]:
runtime.gopark(0x94?, 0xc0000e7958?, 0x40?, 0x79?, 0xb?)
	runtime/proc.go:402 +0xce fp=0xc0000e7910 sp=0xc0000e78f0 pc=0x55d72eec70ae
runtime.netpollblock(0x55d72ef2d5f8?, 0x2ee8fbc6?, 0xd7?)
	runtime/netpoll.go:573 +0xf7 fp=0xc0000e7948 sp=0xc0000e7910 pc=0x55d72eebf2f7
internal/poll.runtime_pollWait(0x7f3ca502ff28, 0x72)
	runtime/netpoll.go:345 +0x85 fp=0xc0000e7968 sp=0xc0000e7948 pc=0x55d72eef3b45
internal/poll.(*pollDesc).wait(0xc0000ee100?, 0xc0000f6000?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0000e7990 sp=0xc0000e7968 pc=0x55d72ef43a67
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0000ee100, {0xc0000f6000, 0x1000, 0x1000})
	internal/poll/fd_unix.go:164 +0x27a fp=0xc0000e7a28 sp=0xc0000e7990 pc=0x55d72ef445ba
net.(*netFD).Read(0xc0000ee100, {0xc0000f6000?, 0xc0000e7a98?, 0x55d72ef43f25?})
	net/fd_posix.go:55 +0x25 fp=0xc0000e7a70 sp=0xc0000e7a28 pc=0x55d72efb2a45
net.(*conn).Read(0xc00006a098, {0xc0000f6000?, 0x0?, 0xc0000a6ed8?})
	net/net.go:185 +0x45 fp=0xc0000e7ab8 sp=0xc0000e7a70 pc=0x55d72efbcd05
net.(*TCPConn).Read(0xc0000a6ed0?, {0xc0000f6000?, 0xc0000ee100?, 0xc0000e7af0?})
	<autogenerated>:1 +0x25 fp=0xc0000e7ae8 sp=0xc0000e7ab8 pc=0x55d72efc86e5
net/http.(*connReader).Read(0xc0000a6ed0, {0xc0000f6000, 0x1000, 0x1000})
	net/http/server.go:789 +0x14b fp=0xc0000e7b38 sp=0xc0000e7ae8 pc=0x55d72f0d7a0b
bufio.(*Reader).fill(0xc000044480)
	bufio/bufio.go:110 +0x103 fp=0xc0000e7b70 sp=0xc0000e7b38 pc=0x55d72f094303
bufio.(*Reader).Peek(0xc000044480, 0x4)
	bufio/bufio.go:148 +0x53 fp=0xc0000e7b90 sp=0xc0000e7b70 pc=0x55d72f094433
net/http.(*conn).serve(0xc0000c01b0, {0x55d72f416b68, 0xc0000a6db0})
	net/http/server.go:2079 +0x749 fp=0xc0000e7fb8 sp=0xc0000e7b90 pc=0x55d72f0dd769
net/http.(*Server).Serve.gowrap3()
	net/http/server.go:3290 +0x28 fp=0xc0000e7fe0 sp=0xc0000e7fb8 pc=0x55d72f0e1fe8
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc0000e7fe8 sp=0xc0000e7fe0 pc=0x55d72eef8e81
created by net/http.(*Server).Serve in goroutine 1
	net/http/server.go:3290 +0x4b4

rax    0x0
rbx    0x7f3c467fd640
rcx    0x7f3ca22429fc
rdx    0x6
rdi    0x2bb
rsi    0x2be
rbp    0x2be
rsp    0x7f3c467fa430
r8     0x7f3c467fa500
r9     0x5
r10    0x8
r11    0x246
r12    0x6
r13    0x16
r14    0x7f3ca434e260
r15    0x7f3ca23c6860
rip    0x7f3ca22429fc
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
time=2025-01-05T23:38:28.288+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error:CHECK_TRY_ERROR(op(ctx, src0, src1, dst, src0_dd_i, src1_ddf_i, src1_ddq_i, dst_dd_i, dev[i].row_low, dev[i].row_high, src1_ncols, src1_padded_col_size, stream)): Meet error in this line code!"

As suggested here I also tried OLLAMA_NUM_GPU=1 which results in a illegal instruction error (although the CPU does support AVX/2):

root@5fb1794a72a1:/llm/ollama# ZES_ENABLE_SYSMAN=1 OLLAMA_NUM_GPU=1 ./ollama serve
2025/01/05 23:36:29 routes.go:1197: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-01-05T23:36:29.576+08:00 level=INFO source=images.go:753 msg="total blobs: 10"
time=2025-01-05T23:36:29.576+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> ollama/server.(*Server).ShowHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2025-01-05T23:36:29.577+08:00 level=INFO source=routes.go:1248 msg="Listening on [::]:11434 (version 0.4.6-ipexllm-20250105)"
time=2025-01-05T23:36:29.577+08:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama357000223/runners
time=2025-01-05T23:36:29.617+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners=[ipex_llm]
[GIN] 2025/01/05 - 23:36:39 | 200 |       26.01µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/01/05 - 23:36:39 | 200 |    4.596428ms |       127.0.0.1 | POST     "/api/show"
time=2025-01-05T23:36:39.040+08:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2025-01-05T23:36:39.041+08:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2025-01-05T23:36:39.041+08:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2025-01-05T23:36:39.041+08:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2025-01-05T23:36:39.043+08:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2025-01-05T23:36:39.052+08:00 level=INFO source=server.go:105 msg="system memory" total="31.2 GiB" free="24.8 GiB" free_swap="8.0 GiB"
time=2025-01-05T23:36:39.053+08:00 level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[24.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.5 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[5.5 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.6 GiB" memory.weights.nonrepeating="105.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB"
time=2025-01-05T23:36:39.057+08:00 level=INFO source=server.go:401 msg="starting llama server" cmd="/tmp/ollama357000223/runners/ipex_llm/ollama_llama_server --model /models/blobs/sha256-ff82381e2bea77d91c1b824c7afb83f6fb73e9f7de9dda631bcdbca564aa5435 --ctx-size 8192 --batch-size 512 --n-gpu-layers 1 --threads 6 --no-mmap --parallel 4 --port 35309"
time=2025-01-05T23:36:39.057+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-05T23:36:39.057+08:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-01-05T23:36:39.058+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T23:36:39.092+08:00 level=INFO source=runner.go:956 msg="starting go runner"
time=2025-01-05T23:36:39.092+08:00 level=INFO source=runner.go:957 msg=system info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=6
time=2025-01-05T23:36:39.092+08:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:35309"
llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from /models/blobs/sha256-ff82381e2bea77d91c1b824c7afb83f6fb73e9f7de9dda631bcdbca564aa5435 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Mistral-7B-Instruct-v0.3
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 32768
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32768]   = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32768]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32768]   = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 771
llm_load_vocab: token to piece cache size = 0.1731 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32768
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.25 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = Mistral-7B-Instruct-v0.3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 781 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size =    0.27 MiB
time=2025-01-05T23:36:39.308+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =   117.03 MiB
llm_load_tensors:  SYCL_Host buffer size =  3804.98 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Graphics [0xe20b]|    1.6|    160|    1024|   32| 12168M|            1.3.31294|
llama_kv_cache_init:      SYCL0 KV buffer size =    32.00 MiB
llama_kv_cache_init:  SYCL_Host KV buffer size =   992.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.56 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   560.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =   552.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 190
time=2025-01-05T23:36:41.274+08:00 level=WARN source=runner.go:894 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel
time=2025-01-05T23:36:41.515+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server not responding"
time=2025-01-05T23:36:42.853+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-01-05T23:36:43.104+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: signal: illegal instruction (core dumped)"

I've also tried different combinations of other parameters (e.g., OLLAMA_NUM_PARALLEL=1) but without any luck. I always get these two errors, depending on the value of OLLAMA_NUM_GPU.

@ACupofAir
Copy link
Contributor

We have updated the docker image and have verified that this problem has been solved. Please follow this command docker pull intelanalytics/ipex-llm-inference-cpp-xpu:latest to update the image and try again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants