-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: llama.cpp with Vulkan not running on Snapdragon X + Windows (Copilot+PCs) #8455
Comments
I think that's the same bug that happens on Snapdragon phones: #5186 Some shader compiler bug in the Adreno driver. Might be good to report it to Qualcomm. |
@0cc4m thanks a lot. I will look into it. I tried to run it also on WSL2 with its Microsoft CPU-emulated Vulkan-Driver. There it does not crash, but the results generated are garbage (as well as very slow). I have tried to reach out to Qualcomm and will see if they answer. |
Have you experimented with the link https://apps.microsoft.com/detail/9nqpsl29bfff?hl=en-US&gl=US that facilitates the conversion of Vulkan shaders to D12 shaders? |
@skyan - thanks, its latest version installs automatically on the Surface with Snapdragon X. It implements a Microsoft Vulkan driver, which shows in addition to the Qualcomm Vulkan driver, and both show in vulkaninfo. llama.cpp's Vulkan backend picks the native Qualcomm driver (probably derived from their Android work), which seems to implement more/better Vulkan features than Microsoft's translation-driver to the (Qualcomm provided) native DX12 driver. I will have a look, if I can test-tweak the llama.cpp Vulkan backend to use the Microsoft translation driver and report here, if I manage it and it works. |
@AndreasKunar To try it you just have to manually pick the device, by setting the environment variable |
@0cc4m and @sykuang - thanks a lot! Now with Microsoft's Vulkan to DX12 driver selected, the error changes from "llama_model_load: error loading model: vk::Device::createComputePipeline: ErrorUnknown" to "llama_model_load: error loading model: vk::Device::createComputePipeline: ErrorOutOfHostMemory". Trace with tinyllama-1.1b/ggml-model-f16.gguf below: ggml_vk_instance_init() |
That means that the D3D12 translation layer doesn't provide enough shared memory for the big ( |
@0cc4m and @AndreasKunar, I wanted to let you know that thanks to @0cc4m's input, I can run phi-3 on the Snapdragon X platform. |
@sykuang - cool! What did you do exactly? Just reduce/edit the kernels (which?) and set GGML_VK_VISIBLE_DEVICES=1 (to the D2D12 driver)? |
@AndreasKunar, I've adjusted the code in ggml_vk_load_shaders by commenting out certain sections, which has resolved the issue where DirectX was reporting out-of-memory errors. You can refer to sykuang@b18e648 |
Thanks, I tried to get it to run, but once I offloaded Phi3-4k layers onto the GPU, either the results got strange or llama-cli crashed. Vulkan performance on Snapdragon X Plus also was much worse than e.g. Q4_0_4_8 - Q4 vs. Q4_0_4_8 vs. Vulkan:
Vulkan0: Microsoft Direct3D12 (Snapdragon(R) X Plus - X1P64100 - Qualcomm(R) Adreno(TM) GPU) (Dozen) | uma: 1 | fp16: 1 | warp size: 64
build: 081fe43 (3441) Can you maybe show your performance (llama-bench -p 512 -n 128) results? Currently for me it looks like Q4_0_4_8 quantization using the CPU's MATMUL is much faster than Vulkan. |
|
Thanks a lot, mea-culpa, I did not know this. However even the "reduced-kernels" with the Vulkan-Backend and the Vulkan-to-DirectX12 driver as well as the Phi3 Q4 models run out of memory on load for me. The Q4_0_4_8 did at least run llama-bench, and sorry, I never verified the llama-cli results (which I now see, are garbage). What I'm trying to find out, is if Vulkan/GPU on Snapdragon X for Q4 is (or can be) faster than the Q4_0_4_8 optimized kernels on the CPU. |
@AndreasKunar You made further than I did when I last attempted this. I didn't even got as far as building the native ARM64 Vulkan loader. I had to rely to that Vulkan wrapper provided by Microsoft, and yes, full of garbage and really slow :( Hit me up with anything you want help testing. I'm going to try building Vulkan-Loader vulkan-1.lib locally. Was it straightforward? |
Thanks!!! My problem is that the native Qualcomm Vulkan driver does not load (according to 0cc4m a similar bug to Android). And the Microsoft Vulkan to DirectX12 translation driver runs out of memory, also seems very slow. So currently I have given up, because the Q4_0_4_8 acceleration for the Snapdragon X CPU is now nearly as fast as my M2 Mac's 10-core GPU (which should i theory be faster than the Snapdragon's GPU). I am waiting to see, what the work on QNN (Qualcomm NPU) in PR#6869 achieves - probably not speed, but less power-consumption. My Surface Pro 11 tends to overheat+throttle even with its Snapdragon X Plus, when running all cores at full load with llama.cpp.
Totally straightforward. I suggest to use the llama.cpp build instructions for WoA (my PR with the description just got merged) to setup VS2022+tools. git-clone Vulkan-Loader. build with |
Current status - I can't get llama.cpp/Vulkan to run under Windows on ARM with Snapdragon X (Surface 11 Pro base model) and give up for now. When debugging it, there always is an internal exception thrown after the call: With the Snapdragon X's Adruino anyway probably not faster than the Q4_0_4_8 CPU acceleration, I'm giving up on llama.cpp/vulkan on the Snapdragon X and close this issue. Q4_0_4_8 on the Snapdragon X CPU has approx. the same performance as Q4_0 on my 10-GPU M2. I'm shifting to try and work with the ollama team to get ollama to run on WoA with the Snapdragon X and support Q4_0_4_8. |
What happened?
The new Copilot+PCs with Qualcomm Snapdragon X processors (in my case a Surface 11 Pro with Snapdragon X Plus and 16GB RAM) are fast, and run llama.cpp on the CPU w/o issues. They also include a Vulkan driver and run the Vulkan samples w/o problems. But llama.cpp built with Vulkan does (now finally build,) but not run.
llama-cli is terminating on model-load with:
llama_model_load: error loading model: vk::Device::createComputePipeline: ErrorUnknown
llama_load_model_from_file: failed to load model
main: error: unable to load model
Name and Version
llama-cli version: 3378 (71c1121) with a quick-fix to compile (see #8446), built with MSVC 19.40.33812.0 for ARM64
built with:
Installed VulkanSDK for Windows x64, then built a Windows arm64 version of KhronosGroup/Vulkan-Loader vulkan-1.lib (+tested its functionality with tests+samples) and copied it to VulkanSDK lib-directory for llama.cpp building.
What operating system are you seeing the problem on?
Windows
Relevant log output
console output.txt
main.log
vulkaninfo.txt
The text was updated successfully, but these errors were encountered: