Feature : Add SYCL runtime support #747

abhilash1910 · 2023-09-05T18:07:41Z

In the plan to support an extensive ecosystem spanning from Huggingface, Lora/LLama families, this addition would enable this great FW to run on our GPU cards seemlessly. The discussion started from this merged PR (artidoro/qlora#219) and currently this PR would eventually enable low precision quantization. (PR is initial draft state).
Tagging @TimDettmers . Thanks for creating this repository.

abhilash1910 · 2023-12-13T08:49:45Z

@Titus-von-Koeller based on issue #894 , there are some abstractions which are specific to certain intel devices which are to be selectively implemented in Ipex but for a majority of the ops/kernels to be supported across all SYCL devices this integration is needed. This is a WIP as it requires some dependencies to be resolved on Intel LLVM to make it runnable across SYCL devices. In general this is a broader approach catering to SYCL runtime. @jianan-gu

Titus-von-Koeller · 2023-12-15T15:50:58Z

Dear @abhilash1910,

Thanks for submitting this PR and flagging the connection with #894, really appreciated.

Would you be so kind to give a bit more background and describe in detail that you're PR is aiming to solve (once it's complete) and what roadblocks you see before getting there?

Beware, that contrary to Tim, my systems programming knowledge is not so good, yet, but once I understand the above better, I'm happy to coordinate + help along this contribution.

Could you please elaborate specifically where you see overlap + hurdles to resolve relative to #894?

abhilash1910 · 2023-12-17T06:32:10Z

Hi @Titus-von-Koeller
Thank you.
Yes sure , the main aim of this PR is to provide a common runtime for SYCL kernels - this includes device agnostic kernels which can be made to run on SYCL runtime. I had started this initially for our Intel GPU devices , which would be now partly covered by the IPEX module (specified in 894) . For users and developers who are not using IPEX runtime entirely , and want the bare metal performance on (nearly all) SYCL runtime support devices - this PR is needed.
Although the IPEX pathway would provide some intrinsic optimizations through IPEX kernels ( and some specific ISAs which will benefit certain sets of our GPU lineups) , this is a more generic standalone implementation which is decoupled from IPEX and can be compiled and run separately . The entirety of this port relies on Intel LLVM's clang which is also being modified to make comparable care metal APIs . SYCL is a runtime built on top of OpenCL (which is the defacto runtime supported in majority of CPUs) and can be aligned as a middleware layer compatible with Nvidia/AMD and other devices supporting Sycl runtime (from Khronos).
The initial motivation was to provide support to our devices specifically (GPU sets) but with the growing adoption of oneAPI and SYCL on all device backends this would provide another way to users who want to rely on pure compiler runtimes for performance.
As of yet, there are not much compat hurdles when compared to my colleagues PR (894) as much of IPEX work would be on the IPEX framework (including quant/dequant) .But when it comes to certain functionality issues, our compiler /SYCL backend and LLVM is now getting updated to more robust APIs for translation/and analysis which is why the PR is in draft mode. I am working on the compiler support as well and since most of the translations are in place some part is left which will be done soon.
Let me know your thoughts and appreciate your help and responses. :)

TimDettmers · 2024-01-02T07:16:35Z

This is great, thank you so much for your contribution!

Can you run me through how a user would use/compile this for their Intel device? A step-by-step procedure would be great to get a easy overview for users.

abhilash1910 · 2024-03-20T12:37:02Z

Hi @TimDettmers , @Titus-von-Koeller as my fellow colleagues are working on the common device abstraction properties , this is an alternative addition to support SYCL backend across vendors . Since here we are only abstracting the sycl kernels, there is a need for a dedicated cmake for running only the kernels. The compilation of this would require a standalone cmake with the following options:

intel basekit (oneapi latest release)
IntelSYCL and IntelLLVM (this would be present within the basekit)
c++17 standards as SYCL has a deep linkage
icpx and icx compiler
fsycl targets for spirv (Intel) ,ptx(Nv) and amdgcn(AMD)
Certain compiler level optimizations (narrowing/lowering etc).
This work will be a parallel separate effort to meet customer requirements who are interested in bare sycl performance ,from the existing effort of my colleagues, and gradually more SYCL organizations would be contributing to this PR for kernels. I will be sending out a detailed description of our sycl backend support. Also to get a better idea regarding this, @Titus-von-Koeller @TimDettmers if you would be available to discuss via a meeting regarding the strategies - firstly the device common strategy which my colleagues are working on and specially the SYCL kernel path which is the current PR.

matthewdouglas · 2024-03-22T15:50:45Z

@abhilash1910 I have a few questions.

What would be the advantage of using the PTX (Nvidia) target on the SYCL implementation vs. existing CUDA backend? I can understand the appeal to have a single abstraction across hardware vendors, but I just can't see shifting away from the CUDA reference implementation. Further, as far as I understand, there's no way to target Apple Silicon.
Does this require the plugin from Codeplay to target AMD when using the Intel compiler? I notice that the Codeplay AMD plugin is still labeled as being "beta."
Is it possible to build with AdaptiveCpp (aka hipSYCL / Open SYCL)?

abhilash1910 · 2024-03-22T18:21:34Z

@abhilash1910 I have a few questions.

What would be the advantage of using the PTX (Nvidia) target on the SYCL implementation vs. existing CUDA backend? I can understand the appeal to have a single abstraction across hardware vendors, but I just can't see shifting away from the CUDA reference implementation. Further, as far as I understand, there's no way to target Apple Silicon.

Does this require the plugin from Codeplay to target AMD when using the Intel compiler? I notice that the Codeplay AMD plugin is still labeled as being "beta."

Is it possible to build with AdaptiveCpp (aka hipSYCL / Open SYCL)?

@matthewdouglas Thanks for the questions. I will try to answer them in best way possible:

Currently we are ramping up SYCL in terms of performance on different clang compilation targets and the intent is to benchmark kernel performance with respect to CUDA. It is not only for a common abstraction at compiler level, but the findings here allow us to optimize llvm to a great extent and inturn perlocate optimizations to sycl and "xpu" logic as a whole(Since xpu is built on top of sycl abstraction). Regarding the adoption, there is an overlap regarding running bare kernels (SYCL) in different Intel GPU variations (probably will come up for b&b post the pythonic development is completed in the other PR) and customer demands. In terms of community , we have been getting a lot of requests - for CUDA SYCL it is for internal and external benchmarking purposes . As of yet SYCL does not target Apple .
I am working with Codeplay on the internal adoption to support amdgcns , and currently the PR should help optimize the backend for amd sycl as well. We have been targetting sycl across gpu vendors to see the stability of the common kernel code.
3.AdaptiveCPP is something which we are discussing internally for amd specifically, but until we have established a SYCL interface , we will eventually go ahead with the discussion & strategy.

Also on other note, I would like to add that SYCL support specifically for kernels would eventually be one of the ways for Intel GPU upstream effort that is ongoing. And having SYCL backend opens a lot of Intel (not only) but also other devices for competitive benchmarking.

abhilash1910 · 2024-03-30T08:12:53Z

Seeing the development update in #898 on a separate branch, I do agree branch separation of backends will be helpful . While the common device abstraction is the probable goal to merge, I think this PR would benefit from a standalone SYCL branch . Since we plan to integrate full sycl support on all kernel adaptations , this would be helpful to maintain as a separate branch. Hence proposing for a separate branch - SYCL.
Would appreciate your thoughts, @TimDettmers @Titus-von-Koeller @matthewdouglas .
cc @jgong5 @yao-matrix
PS: The PR would still take some amount of time to complete and would be eventually assisted by Codeplay.

abhilash1910 · 2024-05-16T13:23:05Z

Yes these are I think the prospective answers:

This is something which really requires a lot of collaborative thought. SYCL being a language to support hardware semantics can be leveraged (through different device specific cmake args) to build for separate runtimes & devices. When the torch extensions arising from these separate cmakes are used to build and make the .so files, we can plug in the torch dispatcher as we normally would in any application. The only catch here is that there would be separate binaries for separate devices which need wrapping with torch . I would also ping @JackAKirk @npmiller @Ruyk @AerialMantis for their thoughts on this.
Yes this is something which is very interesting as to see the performance , specifically of gemm kernels and how much is the gap in perf when it comes to the pure cublas implementation. It would be SYCL runtime compiled for ptx targets vs native Cuda runtime performance. This would give a good idea of further improvement of SYCL standards in general.
SYCL being a language in its entirety does need extensions to be pluggable with torch. The extensions would come after we compiled the binaries (from SYCL) and link device specific binaries with torch extensions. For instance we have DPCPPExtension for SYCL for SPIRV (Intel targets) and standard CUDA_extensions for (Cuda or in some cases HIP). The later case requires the binaries to be compatible with ptx /cuda or amdgcn targets.
There are a couple of considerations on this since we have a standing approachable PR to torch for Intel GPU Support (which contains sycl kernels). It is actually a separate non interfering branch which requires some thoughts. (I am keeping this open for now until I think of a best way if intersection is possible).

abhilash1910 · 2024-06-07T13:41:42Z

@Titus-von-Koeller A detail on this PR: I am nearing completion of the kernels . Since the LOC is quite big for this PR , I think the pytorch aspect of linking the binaries can be extended in a separate PR. I am open to suggestions and can add here as well but it would make the review process very hectic.
Currently the kernels are almost finalised (barring some small changes), it would be great if the review process is also started .
I have a collection of kernel wise test scripts separately ( in native c++ without invoking pt backend) which I can add as a benchmark test suite for sycl . Appreciate thoughts from other reviewers as well. Thanks

matthewdouglas · 2024-06-07T18:56:53Z

One point to note is that bitsandbytes isn't built as a PyTorch extension currently for any of the backends. Instead the shared lib exports a C API that is invoked with Python's ctypes module.

This SYCL backend also does not need to link to libtorch or even python, and since it exposes the same API (see pythonInterface.cpp), it should be possible to just build multiple targets (e.g. spirv, ptx, amdgcn) and load the correct .so at runtime; no CUDAExtension or DPCPPExtension needed.

abhilash1910 · 2024-06-07T19:20:14Z

One point to note is that bitsandbytes isn't built as a PyTorch extension currently for any of the backends. Instead the shared lib exports a C API that is invoked with Python's ctypes module.

This SYCL backend also does not need to link to libtorch or even python, and since it exposes the same API (see pythonInterface.cpp), it should be possible to just build multiple targets (e.g. spirv, ptx, amdgcn) and load the correct .so at runtime; no CUDAExtension or DPCPPExtension needed.

Yes this is something which I noticed, and in that case the kernels can be built for multiple targets.
At runtime the kernels can be offloaded without linking to the respective pytorch bindings for the devices.
Thanks for the details @matthewdouglas

abhilash1910 · 2024-07-29T07:09:57Z

Hi @Titus-von-Koeller , pinging for a review . The code is now updated with latest compiler . After some discussion internally , this PR would only consist of the SYCL kernels only so that these can be abstracted across vendors. Since the kernels are completed I would request start of the review process ; also there is a test suite which has been created for SYCL and wanted to discuss how to add that. Nvidia build feature is also added. There is also a seperate repository containing iGPU/ Cuda builds of this fork with python bindings: https://github.com/abhilash1910/bitsandbytes_sycl_samples

FYI: There is another PR from my collegaue with xpu support #1257 which is using ipex for xpu (Intel GPU) support only. We are discussing better ways to formulate it on the multibackend refactor branch.

abhilash1910 · 2024-08-07T18:08:23Z

Pinging @Titus-von-Koeller @matthewdouglas with latest updated code.
cc @JackAKirk , @npmiller
(@zhuhong61 @Liangliang-Ma for aware)

Titus-von-Koeller · 2024-08-08T10:02:01Z

Dear @abhilash1910,

Thanks for this valuable contribution. I see you've made major progress, impressive!

We (the bitsandbytes team) are under very tight bandwidth at the moment so we won't have time to deep dive into it just yet. Currently we're both trying to get out of a maintenance impasse and preparing for the Intel CPU+XPU, AMD ROCm alpha release. Additionally, we're working on infra, pkging and testing improvements. Also, the multi-backend stuff still needs a major refactor with everyone waiting for us to finalize the backend abstraction with a move to the torch.library API for registering custom ops and dispatching through the PyTorch dispatcher.

These things currently take priority.

We'll however try to take a look as soon as we can and with @matthewdouglas having just been added to the team full-time, we're in a better position going forward.

Titus-von-Koeller · 2024-08-08T10:05:39Z

Also there is a test suite which has been created for SYCL and wanted to discuss how to add that

@abhilash1910 are these the tests you were referring to?

abhilash1910 · 2024-08-08T10:14:05Z

Hi @Titus-von-Koeller
Thanks for the details. Yes it is infact just the public version of it - there is an internal testing suite which is present. PLan to add that to my personal repository - until we see how to add to sycl-refactor.
For the XPU logic addition, I believe some of these kernels would get eventually used up for 1257. I am aware of the major refactoring going on for the multi-factor backend and for XPU specifically we are currently prototyping for the full integration.
Some context on this work, currently we are examining kernel-kernel performance on primarily a100 / iGPU (datacenter) for public competitive benchmarking purposes. I believe that down the line this branch would evolve to be a sycl runtime only version of bitsandbytes for compiler level performance & analysis. In the meantime, would appreciate triggering the CI.
Appreciate the work you guys are doing , and I am taking a look into the XPU upsteaming logic as well internally for multi-backend refactor.

init kernel headers

75d9277

abhilash1910 marked this pull request as draft September 5, 2023 18:07

Jacoby1218 mentioned this pull request Sep 7, 2023

Intel Arc thread oobabooga/text-generation-webui#3761

Closed

abhilash1910 and others added 4 commits September 20, 2023 01:43

modify kernel header

279fce4

modify headers

d9dcee9

update kernels

ddae8ab

Merge branch 'TimDettmers:main' into sycl

adbffb7

abhilash1910 changed the title ~~Feature : Add Intel GPU/SYCL runtime support~~ Feature : Add SYCL runtime support Dec 13, 2023

Titus-von-Koeller added the Intel Integration label Dec 27, 2023

TimDettmers added medium priority (will be worked on after all high priority issues) Medium risk Risk of bugs in transformers and other libraries labels Jan 2, 2024

abhilash1910 and others added 2 commits March 12, 2024 16:44

Merge branch 'TimDettmers:main' into sycl

26ad690

add partial kernels

1f2c8bc

add quant kernels

dd935ad

matthewdouglas mentioned this pull request Mar 20, 2024

DRAFT: allow for multiple hardware backends #1077

Closed

abhilash1910 added 6 commits March 20, 2024 22:56

add precondition optimizer kernel sycl

d93fc2e

add koptimizer 8bits sycl

29e895f

add row col transform kernels

98bd39b

add gemm kernels

8e15d5a

fix header path

bd9e829

fix handler kernels

45318d6

integrate kenrels with ops on kquant

c9d606b

fix extract function

ee8225e

abhilash1910 added 14 commits May 20, 2024 04:53

refine igemmlt kernels

49ca1d7

refine dnn and gemm 4 bit kernels

342f95c

refine spm experimental

5e2611a

fix host dereference issue on nv dequant & kquant

6af46e2

fix header name

bcad0ea

fix nv host dereference issue on 8 bit

6d3ed26

fix nv issue k quants

4f19d41

fix nv issue on row col quant

dd709ca

fix nv issue on 32 bit

0fbedfe

fix nv issue on k dequant blockwise

38312cb

fix nv issue on gemms and some quant kernels

1d6f56b

refine

b107b9c

refine

7838151

refine

46d10ba

abhilash1910 marked this pull request as ready for review June 7, 2024 13:44

add dnn build flag

8795844

abhilash1910 added 2 commits June 27, 2024 10:29

update dnn linkage

7007a02

Merge branch 'sycl-refactor' into sycl

50a2838

abhilash1910 added 2 commits August 7, 2024 10:58

fix cmake

6f8ef48

upgrade to 2024.2 Intel LLVM compiler release, set Nvidia build flag

a425d44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature : Add SYCL runtime support #747

Feature : Add SYCL runtime support #747

abhilash1910 commented Sep 5, 2023

abhilash1910 commented Dec 13, 2023

Titus-von-Koeller commented Dec 15, 2023

abhilash1910 commented Dec 17, 2023

TimDettmers commented Jan 2, 2024

abhilash1910 commented Mar 20, 2024

matthewdouglas commented Mar 22, 2024

abhilash1910 commented Mar 22, 2024

abhilash1910 commented Mar 30, 2024 •

edited

Loading

abhilash1910 commented May 16, 2024

abhilash1910 commented Jun 7, 2024

matthewdouglas commented Jun 7, 2024

abhilash1910 commented Jun 7, 2024

abhilash1910 commented Jul 29, 2024 •

edited

Loading

abhilash1910 commented Aug 7, 2024 •

edited

Loading

Titus-von-Koeller commented Aug 8, 2024 •

edited

Loading

Titus-von-Koeller commented Aug 8, 2024

abhilash1910 commented Aug 8, 2024 •

edited

Loading

Feature : Add SYCL runtime support #747

Are you sure you want to change the base?

Feature : Add SYCL runtime support #747

Conversation

abhilash1910 commented Sep 5, 2023

abhilash1910 commented Dec 13, 2023

Titus-von-Koeller commented Dec 15, 2023

abhilash1910 commented Dec 17, 2023

TimDettmers commented Jan 2, 2024

abhilash1910 commented Mar 20, 2024

matthewdouglas commented Mar 22, 2024

abhilash1910 commented Mar 22, 2024

abhilash1910 commented Mar 30, 2024 • edited Loading

abhilash1910 commented May 16, 2024

abhilash1910 commented Jun 7, 2024

matthewdouglas commented Jun 7, 2024

abhilash1910 commented Jun 7, 2024

abhilash1910 commented Jul 29, 2024 • edited Loading

abhilash1910 commented Aug 7, 2024 • edited Loading

Titus-von-Koeller commented Aug 8, 2024 • edited Loading

Titus-von-Koeller commented Aug 8, 2024

abhilash1910 commented Aug 8, 2024 • edited Loading

abhilash1910 commented Mar 30, 2024 •

edited

Loading

abhilash1910 commented Jul 29, 2024 •

edited

Loading

abhilash1910 commented Aug 7, 2024 •

edited

Loading

Titus-von-Koeller commented Aug 8, 2024 •

edited

Loading

abhilash1910 commented Aug 8, 2024 •

edited

Loading