-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Caching device_info in device_ext to restore TG performance #8301
[SYCL] Caching device_info in device_ext to restore TG performance #8301
Conversation
Probably related to #8286 |
3424fce
to
8055548
Compare
Yes indeed there has been some overlap there. Will see how we can sort this out. |
I attempted to compile and try it on my integrated Intel GPU. It compiles but it does not work on inference. I tried command |
Thanks for flagging that @mistrjirka, we haven't tested the recent LLama 3 on our SYCL backend end yet. Is this is an error you get on the master branch as well or is it introduced in this patch specifically? |
I think the solution of #8286 is more explicit, |
Hi @airMeng |
max_work_group_size is an integer data. So, in PR #8286, I use the local cache ( int array) in the static class ggml_sycl_info():
This solution is a formal solution, instead of work around. Because the static class instance ggml_sycl_info() is const, and all access are read only. In this PR, use complex code with mux(). |
I am sorry I accidentally was on a wrong branch. I tested it on the correct branch now, and I can confirm that it works without problem. I do not observe any performance improvement, but I suppose that is because A100 and Iris Xe are quite different GPUs.
|
Closing this as another fix tackling the same issue got merged first : #8286 |
This patch introduces a caching mechanism within
device_ext
to persist thedevice_info
-s within thedev_mgr
singleton instance.This particularly lifts the runtime overhead introduced by the current
get_work_group_size
implementation that triggers unnecessary device prop queries only to grab the wg size, namely for norm and softmax operations.Performance of Text Generation specifically got recovered :
Nvidia A100 + 70B Q4_K : 2.0 t/s -> 10.5 t/s (~ x5)
Nvidia A100 + 13B Q4_K : 4.25 t/s -> 42.5 t/s (~ x10)
Pinging @joeatodd @airMeng @AidanBeltonS @abhilash1910