Replies: 22 comments 80 replies
-
I'd like handle: Multi-card Support, CI test error for more than one GPU is detected and used. |
Beta Was this translation helpful? Give feedback.
-
For code cleaning and sanitization of compiler runtime I will be adding patch post the previous changes |
Beta Was this translation helpful? Give feedback.
-
Would like to see support for SOTA 2-bit quantized models (GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ3_XXS). Been trying to do this myself for past hour or two, using dpct isn't as trouble-free as it makes you think. |
Beta Was this translation helpful? Give feedback.
-
before airMeng:sycl_fix_max_alloc_size I don't know if these numbers can be considered when the model produces useless output. Using device 0 (Intel(R) Arc(TM) A770 Graphics) as main device
after
vulkan latest
|
Beta Was this translation helpful? Give feedback.
-
I wonder why build fails with I think this GPU support f16. But anyways getting 16 tokens/sec for q4_k_m for 7B with all layers on GPU. With batch bench, mitral -7B Q4_K_M:
|
Beta Was this translation helpful? Give feedback.
-
I would like to know if there are plans to support quantization types that are not currently supported like iq3, iq4? |
Beta Was this translation helpful? Give feedback.
-
With the recent changes, the model nous-hermes-2-34b-2.69 has seen a significant speedup. From the unusable 3-4 tokens per second, it now reaches 7-8 detect 1 SYCL GPUs: [0,] with Max compute units:512
build: 21b0867 (2345) for comparison Vulkan0: Intel(R) Arc(TM) A770 Graphics | uma: 0 | fp16: 1 | warp size: 32
build: 82cb31e (2348) I hope other quantization methods will also see an improvement, for now, they more or less perform similarly to Vulkan. |
Beta Was this translation helpful? Give feedback.
-
I have a question. During prompt processing or generation, the llama.cpp's SYCL backend seems to use only one of the (I am assuming XMX) engines of my GPU. Although they are tagged 'unknown' in intel_gpu_top. Does anyone knows why? And is it possible to parallelize across all of them? On Linux, I can't even monitor neither the VRAM usage, nor the temps which is surprising because it is a while since this GPU was launched. My all hopes for the new Xe driver. |
Beta Was this translation helpful? Give feedback.
-
@akarshanbiswas can you try https://github.com/intel/xpumanager to monitor the usage? |
Beta Was this translation helpful? Give feedback.
-
Adding this here. This may add to the list of todos and fixes. Credit goes to Gemini 1.5 Pro 1 million. :)
Will update when I find anything else with my limited knowledge. |
Beta Was this translation helpful? Give feedback.
-
The SYCL backend should be updated to adopt these changes in ggml-backend:
|
Beta Was this translation helpful? Give feedback.
-
Just an update here: I did not use llama.cpp for like few days because I was busy. I ran it today to test llama-3 and I found out that it hangs here everytime with every model right here:
I am running with --no-mmap . In the logs I found out:
Not sure if this is because of an update that I received on Arch Linux or not, not a single model is running with the same binary that it is used to run before. Update: Not related to llama.cpp, JAX with intel-extension-for-openxla hangs too. (now confirmed) Update 2: Came across this: intel/compute-runtime#497 |
Beta Was this translation helpful? Give feedback.
-
Latest SYCL is broken due to #7640 (comment), I am looking into it and hope to fix it soon. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Hi, I'm trying to compile the SYCL backend but the compiler seems to almost get stuck on I do have an older, slower CPU (i5-8350U).
|
Beta Was this translation helpful? Give feedback.
-
Does performance supposed to get worse when using the keys cache quantization to either q8 or q4? Normally with f16 gives 20 tokens/sec and with q_8 gives 5 tokens/sec on my Intel Arc. |
Beta Was this translation helpful? Give feedback.
-
Hello - a heads up from the Codeplay side - we found that recent changes (#6408) introduced a lot of expensive device info queries: const int work_group_size = get_work_group_size(stream->get_device()); Unfortunately these queries aren't cached, and due to the way the DPCT headers are currently designed, this actually makes 15 different device info queries each time its called 🫠. @OuadiElfarouki from our side is working on a caching mechanism which should fix this. On Nvidia hardware we found this created a significant performance drop (down from ~10 T/s to ~2 T/s). Like I say, we are working on a fix, and I am just posting for info. @airMeng @NeoZhangJianyu Maybe these queries are cheap on Intel drivers, but I am seeing above a lot of discussions about performance regressions. Could be related? @zhentaoyu |
Beta Was this translation helpful? Give feedback.
-
A bug that I introduced with #8644 reveals that we don't have any CI which tests static builds ( I suggested a couple of approaches here. What do you think is best? |
Beta Was this translation helpful? Give feedback.
-
Hello, here is a Feature Request: Can SYCL build be supported in llama.cpp's Nix Flakes? I want to use it in NixOS but struggle to package oneAPI. |
Beta Was this translation helpful? Give feedback.
-
I just found out that SYCL backend don't use This is confirmed by the comment here: which means that backend can get much faster with proper implementation on discrete GPUs. It is not even utilizing the hardware properly! I need to forget about flash attention (the issue was closed for being stale) We have many examples here and I am not sure how to implement this. |
Beta Was this translation helpful? Give feedback.
-
thank you SYCL crew, all the work is appreciated. |
Beta Was this translation helpful? Give feedback.
-
Hi. I have some questions. We can simply call the functions with the required parameters which should be better right? |
Beta Was this translation helpful? Give feedback.
-
Feel free to drop a note, let's know if you have any feature request or bugs (even unconfirmed)
Current code returns all SYCL devices, including CPU, GPU (level-zero, opencl), FPGA. SYCL only support GPU. So when CI test on other devices, it will be fault.
There is known issue of SYCL: memcpy() from host (mmap) to device will hang in same cases. It's not resolved now. A work around solution is no use mmap. I have handled it in llama-bench (add --mmap parameter). We need add to more applications in examples.
Suggest to handle it after multiple-card is finished. Lots of such unused code will be useful for multiple-card feature.
Also let's know if you have taken any tasks here.
cc @NeoZhangJianyu @luoyu-intel @abhilash1910
Beta Was this translation helpful? Give feedback.
All reactions