Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan Bugfixes and Improvements #7084

Merged
merged 9 commits into from
May 9, 2024
Merged

Vulkan Bugfixes and Improvements #7084

merged 9 commits into from
May 9, 2024

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented May 5, 2024

Apologies for the wait since the last update, I was rather busy.

Here are a number of bugfixes that should hopefully fix incoherence that the Vulkan backend showed for a while now.

I also modified the MMV shaders to run batches in a single call instead of multiple calls, this might improve performance on devices with a higher shader invocation overhead.

Finally, there's further work towards running MoE models with Vulkan included, but the mul_mat_id code is not ready yet.

@MaggotHATE
Copy link
Contributor

Thank you for the update! Unfortunately, I'm seeing gibberish on a 10.7B model at Q5_K_S (a framkenmerge, sure, but it works just fine on cpu and clblast) with a long initial prompt (~2700 tokens). Parameters are n_ctx 4096, n_batch 2048, u_batch 512. Setting n_batch to 4096 doesn't help. Another model, 7b at q6_K, gives correct output on Vulkan with these settings.

Additionally, there's a very noticeable delay after detecting Vulkan device on Win 10 (new system, still 1060 3GB), which was hardly noticeable on Win 8.1. That, however, might or might not be caused by a singe graphics output device on my new system (previous CPU had an iGPU, which wasn't used, though).

Finally, there's still a huge difference in memory consumption. It seems like the difference for VRAM is even larger now: on that 10.7B model, 9 layers with clblast occupy 1792 MB, while 7 layers with Vulkan occupy 2524 MB. Also, it uses ~300 MB of shared VRAM with any number of layers.

With no_kv_offload Vulkan now uses even more shared VRAM, which probably makes sense (previously it just used RAM - or maybe it's just Windows's quirks).

At the same time, the difference in speed between this and clblast is even bigger, Vulkan is really fast both in prompt processing and token generation.

@0cc4m
Copy link
Collaborator Author

0cc4m commented May 5, 2024

Thank you for the update! Unfortunately, I'm seeing gibberish on a 10.7B model at Q5_K_S (a framkenmerge, sure, but it works just fine on cpu and clblast) with a long initial prompt (~2700 tokens). Parameters are n_ctx 4096, n_batch 2048, u_batch 512. Setting n_batch to 4096 doesn't help. Another model, 7b at q6_K, gives correct output on Vulkan with these settings.

Can you give me a link to the model that's not working? If I can reproduce the source of the incoherence I can hopefully fix it.
Does it work without this PR?

Additionally, there's a very noticeable delay after detecting Vulkan device on Win 10 (new system, still 1060 3GB), which was hardly noticeable on Win 8.1. That, however, might or might not be caused by a singe graphics output device on my new system (previous CPU had an iGPU, which wasn't used, though).

This is most likely shader compilation happening. The GPU driver should cache the shaders, so it should only be slow once with each update and fast on subsequent launches.

Finally, there's still a huge difference in memory consumption. It seems like the difference for VRAM is even larger now: on that 10.7B model, 9 layers with clblast occupy 1792 MB, while 7 layers with Vulkan occupy 2524 MB. Also, it uses ~300 MB of shared VRAM with any number of layers.

This shouldn't have changed compared to without this PR. It's expected that Vulkan uses more VRAM for layers since much more of the model is offloaded. The CLBlast backend basically only runs the matrix multiplication on the GPU and nothing else.

With no_kv_offload Vulkan now uses even more shared VRAM, which probably makes sense (previously it just used RAM - or maybe it's just Windows's quirks).

Shared VRAM is most likely the staging buffers for copying data to and from the GPU. Disabling KV offload means that the KV cache resides in RAM (shared VRAM is RAM), so that's expected behavior.

At the same time, the difference in speed between this and clblast is even bigger, Vulkan is really fast both in prompt processing and token generation.

Thanks for testing! Did the speed improve for you compared to without this PR?

@daniandtheweb
Copy link
Contributor

For me on a Radeon 5700XT the performance is almost the same as the main branch, it's just a little bit slower: 264 t/s on main, 260 t/s pr for prompt processing and 33 t/s on main, 30 t/s pr for generation on llama 3 q5_k_m

@MaggotHATE
Copy link
Contributor

Thanks for testing! Did the speed improve for you compared to without this PR?

Yes, but it's only noticeable on the start. The average is not so impressive for generation: for example, on 7b Q6_K model it goes from 5.033 t/s to 5.064 t/s. Still, it was a 611 tokens result, so the usual slowdown diminishes the improvement.

Processing of that ~2700t instruct: on 10.7B at Q5_K_S (10 layers offloaded) it was 35.895 t/s, with this PR it's 37.755 t/s.
On 7B at Q6_K (9 layers offloaded) it was 29.942 t/s, with this PR it's 30.833 t/s.

so it should only be slow once with each update and fast on subsequent launches.

Ok, I see now: it happened all the time because I was alternating between clblast and vk versions. Also, maybe it's because is uses all available memory (even though it doesn't look like it on graphs)

It seems to cache each version's shaders separately, because launching one doesn't speedup launching the other. Also, the mainline compiles faster, but it's not a big deal.

Can you give me a link to the model that's not working?

https://huggingface.co/mradermacher/Fimbulvetr-10.7B-v1-i1-GGUF - I'm still testing messages reloading in my program, and for some reason this model became a good benchmark for that. I'm not sure of the quality or the changes imatrix brought.

Does it work without this PR?

No, the same gibberish happened. Interestingly, while trying to test it, I struggled to even run the model with that large instruct. I had to increase the amount of layers from 9 to 10 to make it work. It's like a sweetspot - not higher, not lower, exactly 10.

@0cc4m
Copy link
Collaborator Author

0cc4m commented May 5, 2024

Can you give me a link to the model that's not working?

https://huggingface.co/mradermacher/Fimbulvetr-10.7B-v1-i1-GGUF - I'm still testing messages reloading in my program, and for some reason this model became a good benchmark for that. I'm not sure of the quality or the changes imatrix brought.

Does it work without this PR?

No, the same gibberish happened. Interestingly, while trying to test it, I struggled to even run the model with that large instruct. I had to increase the amount of layers from 9 to 10 to make it work. It's like a sweetspot - not higher, not lower, exactly 10.

I downloaded the q5_k_s version of that model you linked and it runs fine for me across AMD and Nvidia GPUs. Not sure what's going on on your end. Which GPU are you using?

@MaggotHATE
Copy link
Contributor

Which GPU are you using?

Same 1060 3GB, and the issue happens on a large initial instruct only. It works just fine with a typical Alpaca instruct or similar.

@MaggotHATE
Copy link
Contributor

MaggotHATE commented May 5, 2024

Update: gibberish just happened on a ~1100t prompt. I wanted to try setting n_ubatch to 2048, but it's too much memory for my setup (16GB RAM). Same on mainline and this PR.
UPD: in case it helps, the ~2700t instruct is this, modified for the model format and with a text included in it.

@teleprint-me
Copy link
Contributor

teleprint-me commented May 7, 2024

I have a question, probably not related, but @0cc4m is the only one that can really answer it. When I use train-text-from-scratch, I see a noticable improvement compared to CPU, but I do not see any GPU usage when I do train. Is there any reason why this is case?

Also, as an aside, the initial implementations allowed me to offload most of the layers to GPU without any hiccups, but now I'll crash if I allocate too many layers to the GPU. This has led me to switching between the CPU and GPU for different tasks. Does this have anything to with with your previous PR where you modified how the layers were handled?

I have narrowed down a general bug to the GLFW backend that is unrelated to llama.cpp, so I'm not sure if it's related or not. Still haven't pinned it down yet. This shouldn't be an issue in the near future because I plan on replacing my RX 580 with either to a RTX 4060 ti or a 7900 XT, haven't decided yet.

Regardless, just curious if you're able/willing to provide any insights?

@netrunnereve
Copy link
Collaborator

I did some quick tests with my W8100 and didn't really see any improvements or regressions. Honestly after getting my CPU server I've been using Vulkan less and less since my GPU is really only good for 7B models and Command R 30B and Llama 70B completely blow away the small ones.

PR:

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 pp 512 95.55 ± 0.30
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 tg 128 11.26 ± 0.03
llama 8B Q6_K 5.53 GiB 7.24 B Vulkan 99 pp 512 71.30 ± 0.29
llama 8B Q6_K 5.53 GiB 7.24 B Vulkan 99 tg 128 6.03 ± 0.08

Master:

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 pp 512 93.49 ± 0.43
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 tg 128 11.75 ± 0.06
llama 7B Q6_K 5.53 GiB 7.24 B Vulkan 99 pp 512 69.26 ± 0.33
llama 7B Q6_K 5.53 GiB 7.24 B Vulkan 99 tg 128 6.09 ± 0.05

@0cc4m
Copy link
Collaborator Author

0cc4m commented May 8, 2024

Update: gibberish just happened on a ~1100t prompt. I wanted to try setting n_ubatch to 2048, but it's too much memory for my setup (16GB RAM). Same on mainline and this PR. UPD: in case it helps, the ~2700t instruct is this, modified for the model format and with a text included in it.

I can't seem to reproduce that issue. But n_batch > 512 is definitely broken, I'll take a look at what's going on there.

@0cc4m
Copy link
Collaborator Author

0cc4m commented May 8, 2024

I have a question, probably not related, but @0cc4m is the only one that can really answer it. When I use train-text-from-scratch, I see a noticable improvement compared to CPU, but I do not see any GPU usage when I do train. Is there any reason why this is case?

To be honest, I have no idea what train-text-from-scratch does and I'd be surprised if it can use Vulkan (assuming that's what you meant)

Also, as an aside, the initial implementations allowed me to offload most of the layers to GPU without any hiccups, but now I'll crash if I allocate too many layers to the GPU. This has led me to switching between the CPU and GPU for different tasks. Does this have anything to with with your previous PR where you modified how the layers were handled?

Do you mean for running a model with main? VRAM use might have changed in later versions. Can you give me more details on what worked, what didn't/doesn't and on what hardware?

I have narrowed down a general bug to the GLFW backend that is unrelated to llama.cpp, so I'm not sure if it's related or not. Still haven't pinned it down yet. This shouldn't be an issue in the near future because I plan on replacing my RX 580 with either to a RTX 4060 ti or a 7900 XT, haven't decided yet.

Regardless, just curious if you're able/willing to provide any insights?

What GLFW backend?

@teleprint-me
Copy link
Contributor

teleprint-me commented May 8, 2024

@0cc4m

To be honest, I have no idea what train-text-from-scratch does and I'd be surprised if it can use Vulkan (assuming that's what you meant)

That is what I meant. There is a definite 3x speed up. A 3 hour training session is dramatically reduced to a ~40-50 minute training session. Every time. I didn't think it would work, but tried it out to see.

I use make for CPU build and make LLAMA_VULKAN=1 for GPU.

Any insights into why this might be the case @ggerganov? I don't know/understand enough about the implementation details.

Do you mean for running a model with main? VRAM use might have changed in later versions. Can you give me more details on what worked, what didn't/doesn't and on what hardware?

I've been avoiding using GPU as much as I can lately because it keeps crashing my entire system.

I would use -ngl n where n would be the number layers related to the model. I would use Mistral 7B v0.2 and offload 32 layers and it would work fine up until a certain point where it would just obviously slow down. GPU would be maxed out by that point. Usually 16 layers was okay, even for my little 8GB VRAM.

I have plenty of CPU RAM, but it's not ideal for back prop.

What GLFW backend?

I feel this is out of scope, but it does affect the AMDGPU DRM for the RX 5xx series and the Vulkan backend has crashed in a similar fashion while using llama.cpp. Part of the reason it's been difficult to trace and isolate.

I'll have to do some thorough tests then when I'm not so deep into my work. Too many projects open at once ATM to risk it. GLFW#2493. There are a few of other identified bugs related to the RX 5xx series with the mesa graphics drivers.

They wanted me to test a AUR package, but I haven't had time to test yet.

@teleprint-me
Copy link
Contributor

I tried reproducing the results since I had to do a upgrade, but I can't with the latest commit. I'm so shuffled at the moment, I think I might of lost track. Vulkan backend does not seem to affect training text from scratch. I must've been mistaken. Sorry for wasting any time. I'll post if I figure it out.

@teleprint-me
Copy link
Contributor

I just wanted to add some relevant input and for this branch. I've been experimenting with it and it cleared up a few issues.

  • No random crashing. Some flickering. This is most likely due to poor driver support at this point. The code seems more stable overall though.
  • No more "gibberish" is be produced by my models. My models are outputting quality content on the GPU.
  • I've been running inference and have noticed improved runtime.
  • I'm able to offload layers as needed.

I'm sure this is a mixture of things, but these differences are extremely noticeable when compared to the master branch. My 30M param model just zips with it as well.

I think you're right about train from scratch not supporting this. I'll have to dig into this some more. Would be nice if it did work.

@0cc4m
Copy link
Collaborator Author

0cc4m commented May 9, 2024

@ggerganov @slaren If I run inference on Mistral 7B q4_k_s with batch size 2048 and context size 2048 (and default ubatch size, which should be 512), I get GET_ROWS calls with src[1]->ne[0] == 0, empty src1. That causes validation issues in Vulkan, cause it leads to calls with dispatch ranges of 0. I can try to catch those, but it propagates through the graph.

The call I used to run into this is build/bin/main -f input_long.txt -c 2048 -n 512 --ignore-eos -m models/airoboros-m-7b-3.1.2.Q4_K_S.gguf -ngl 1000. input_long.txt is a file with a 1737 token prompt.

Is that correct behavior that I need to work around, or is that a bug in the model code?

@slaren
Copy link
Collaborator

slaren commented May 9, 2024

That's normal behavior since #6122, the backend should skip zero size tensors. The Vulkan backend was also updated in that PR, but maybe there are other cases.

@0cc4m
Copy link
Collaborator Author

0cc4m commented May 9, 2024

That's normal behavior since #6122, the backend should skip zero size tensors. The Vulkan backend was also updated in that PR, but maybe there are other cases.

Alright, thank you. I'll make sure they are skipped properly. After that fix this PR should be ready to merge.

@mofosyne mofosyne added bugfix fixes an issue or bug Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 9, 2024
@0cc4m
Copy link
Collaborator Author

0cc4m commented May 9, 2024

I found and fixed the issue. I'll wait for the CI to finish and then merge this PR.

@MaggotHATE
Copy link
Contributor

Gibberish is fixed! Thank you for the update @0cc4m !

@0cc4m 0cc4m merged commit befddd0 into master May 9, 2024
60 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-improvements branch May 9, 2024 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants