[RFC] Extend bitsandbytes to support Intel hardware platforms #894

jianan-gu · 2023-12-01T15:29:28Z

Motivation

The current bitsandbytes library is bound with the CUDA platforms. However, we are seeing that there is a rapidly growing demand to run large language models (LLMs) on more platforms like Intel® CPUs and GPUs devices ("xpu" is the device tag for Intel GPU in PyTorch). Therefore, we aim at extending Intel® CPU and GPU ecosystem support and optimizations to bitsandbytes and offer the same scope of the lower-precision computation features (8bits and 4bits) as CUDA.

Approach

To provide the 8bits and 4bits features for Intel platforms, we propose two major changes as follows:

A device abstraction that allows non-CUDA devices to be added to bitsandbytes easily. It contains a device backend abstraction that contains the key kernel interfaces to implement by each backend, a backend registration interface to add new device backends, and a kernel dispatching mechanism.
Lightweight enabling of Intel CPU and GPU support on top of the device abstraction. We plan to leverage the PyTorch 2.x compiler stack and the custom kernels provided by Intel Extension for PyTorch (IPEX) to support Intel CPU and GPU without the needs of upstreaming native backend codes. This reduces the complexity of adding new devices to bitsandbytes and also reduces maintenance costs.

Device abstraction

We will extend CUDA dependency to Intel CPU/GPU in bitsandbytes device setup and init. We will provide common device abstractions for general devices (there will be no changes on CUDA).
Note that there is also no API or usage change for Huggingface users to use different devices with bitsandbytes.

from transformers import BitsAndBytesConfig
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16,
)
model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)

Lightweight integration for Intel CPU/GPU

We will do lightweight and simple integration to enable low-precision computation features, both 8bits and 4bits. We don't plan to add native backend code for Intel CPU and GPU in the first step. Instead, we employ PyTorch 2.x compilation and Intel® Extension for PyTorch to enable those features.

For performance-critical functions, such as GEMM, we will import IPEX as a Python module and use its API for computation. IPEX can provide the best performance for such functions across Intel devices (for example by adopting the 4th generation of Intel® Xeon® Scalable Processors with Intel® Advanced Matrix Extensions instruction set).
- IPEX, as the speedup optional component, has already been integrated into Huggingface mainstream functionality tools like the Trainer Class and the accelerate repo to speed up training and inference. Similar to Huggingface, we will also integrate IPEX with bitsandbytes as a Python library dependency.

For example:

import intel_extension_for_pytorch
def cpu/xpu_igemmlt(*args, **kwargs):
    # Staff before computation
    C_i32 = torch.ops.torch_ipex.matmul_i8i8i32(A_i8, B_i8)  # GEMM computaion
    # Other staff

For other functions, we adopt the PyTorch 2.x compilation technology. We will implement them using PyTorch basic operators in Python and optimize the functions using torch.compile to get good performance. Intel is one of the major contributors to the torch.compile CPU backend in PyTorch and also hosted the torch.compile GPU backend in IPEX. The implementation can also work for other devices that support the PyTorch 2.x compiler stack.

For example:

@torch.compile
def double_quant_cpu/xpu(*args, **kwargs):
    # Implement double_quant for Intel CPU/GPU with PyTorch ops
    # torch.compile will generate kernel code and compile at runtime

Design

(1) Reorganize device_setup to support multiple devices

Intel CPU or GPU

is_ipex_available
import IPEX OPs (and also check Intel GPU device availability)

CUDA

Remain the same, load from lib_cuda.so

(2) Device backend abstraction with key kernel interfaces

Key functions that are used in mainstream 8bits and 4bits:

Performance-critical:

| F.igemmlt |

Others:

To extend the support of the above functions on Intel CPU/GPU (CUDA remains the same), we propose the following designs:

PR plans:

Enable device abstraction for Intel CPU/GPU and CUDA
Adding options of init Intel CPU/GPU device but no implementations, CUDA remains the same.
Enable 8bits functionality for Intel CPU/GPU
Adding implementations of 8bits functions for Intel CPU/GPU devices.
Enable 4bits functionality for Intel CPU/GPU
Adding implementations of 4bits functions for Intel CPU/GPU devices.

Additional contents

Besides, we will also propose the PR to Transformers upstream to extend the usage of bitsandbytes API on multi-devices.

Transformers changes

_bitsandbytes_available
Not limited to CUDA devices available
Use CUDA/CPU and Intel GPU device here

jianan-gu · 2023-12-01T15:50:18Z

cc @jgong5 @Xia-Weiwen @xiaolil1

jgong5 · 2023-12-01T23:13:18Z

@yao-matrix @jiqing-feng

jianan-gu · 2023-12-05T07:01:43Z

Update the PR #898 for the above first plan.

Enable device abstraction for Intel CPU/GPU and CUDA
Adding options of init Intel CPU/GPU device but no implementations, CUDA remains the same.

Titus-von-Koeller · 2023-12-05T11:54:06Z

@jianan-gu Thanks for this high quality and well-written analysis / design document! Tim and I will look into this, as well as the PR, and come back to you soon.

Unfortunately, both the images come back with a 404. I bet this is because it's from a non-public repo, so it would show up correctly for you (with access).

Would you be so kind to make the images available to us?

jianan-gu · 2023-12-05T13:11:10Z

Hi @Titus-von-Koeller, thanks for your reply and reminders, have reattached the images.
: )

Titus-von-Koeller · 2023-12-11T10:00:37Z

@jianan-gu I've talked with Tim about this and we're definitely going forward with this integration.

The design also looks good, but this question grants a deeper look. Since different hardware support for bitsandbytes will likely remain a topic going forward, we'd like to reflect on this design decision for a moment longer.

Tim is quite busy these days and currently at NeurIPS, so that might delay things a bit.

Xia-Weiwen · 2023-12-19T08:18:30Z

Submitted PR jianan-gu#3 for the above plan step 2 (CPU part).

Enable 8bits functionality for Intel CPU/GPU
Adding implementations of 8bits functions for Intel CPU/GPU devices.

jgong5 · 2023-12-20T01:19:29Z

Submitted PR jianan-gu#2 for the above plan step 2 (CPU part).

You were submitting PR against @jianan-gu 's personal branch, not bitsandbytes mainline?

Xia-Weiwen · 2023-12-21T02:10:19Z

Submitted PR jianan-gu#2 for the above plan step 2 (CPU part).

You were submitting PR against @jianan-gu 's personal branch, not bitsandbytes mainline?

Yes. Because his PR is not merged, we cannot use the mainline as the base.

Xia-Weiwen · 2023-12-21T02:13:24Z

We now have our PR to enable NF4 on CPU/XPU here: jianan-gu#4 for step 3 of our plan:

Enable 4bits functionality for Intel CPU/GPU
Adding implementations of 4bits functions for Intel CPU/GPU devices.

Since out PRs have dependencies to each other, we submitted those PRs in our repos instead of the mainline. We will rebase these PRs when everything is ready.

jianan-gu mentioned this issue Dec 5, 2023

Enable common device abstraction for 8bits/4bits #898

Merged

Titus-von-Koeller assigned TimDettmers Dec 8, 2023

abhilash1910 mentioned this issue Dec 13, 2023

Feature : Add SYCL runtime support #747

Open

jgong5 mentioned this issue Dec 23, 2023

[RFC] cross-platform: Refactoring bitsandbytes/cuda_setup #918

Closed

jgong5 mentioned this issue Jan 3, 2024

[RFC] Supporting Eager Mode via torch.compile pytorch/pytorch#115545

Open

41 tasks

ghost mentioned this issue Jan 4, 2024

Arc Installation - OSError: [WinError 126] The specified module could not be found. Error loading "backend_with_compiler.dll" or one of its dependencies. oobabooga/text-generation-webui#5123

Closed

1 task

Titus-von-Koeller mentioned this issue Jan 26, 2024

fix library detection #873

Merged

matthewdouglas mentioned this issue Feb 6, 2024

Distribute pip wheels for the architecture they are built for #1043

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Extend bitsandbytes to support Intel hardware platforms #894

[RFC] Extend bitsandbytes to support Intel hardware platforms #894

jianan-gu commented Dec 1, 2023 •

edited

Loading

jianan-gu commented Dec 1, 2023

jgong5 commented Dec 1, 2023

jianan-gu commented Dec 5, 2023

Titus-von-Koeller commented Dec 5, 2023 •

edited

Loading

jianan-gu commented Dec 5, 2023 •

edited

Loading

Titus-von-Koeller commented Dec 11, 2023

Xia-Weiwen commented Dec 19, 2023 •

edited

Loading

jgong5 commented Dec 20, 2023

Xia-Weiwen commented Dec 21, 2023

Xia-Weiwen commented Dec 21, 2023 •

edited

Loading

[RFC] Extend bitsandbytes to support Intel hardware platforms #894

[RFC] Extend bitsandbytes to support Intel hardware platforms #894

Comments

jianan-gu commented Dec 1, 2023 • edited Loading

Motivation

Approach

Device abstraction

Lightweight integration for Intel CPU/GPU

Design

(1) Reorganize device_setup to support multiple devices

(2) Device backend abstraction with key kernel interfaces

PR plans:

Additional contents

jianan-gu commented Dec 1, 2023

jgong5 commented Dec 1, 2023

jianan-gu commented Dec 5, 2023

Titus-von-Koeller commented Dec 5, 2023 • edited Loading

jianan-gu commented Dec 5, 2023 • edited Loading

Titus-von-Koeller commented Dec 11, 2023

Xia-Weiwen commented Dec 19, 2023 • edited Loading

jgong5 commented Dec 20, 2023

Xia-Weiwen commented Dec 21, 2023

Xia-Weiwen commented Dec 21, 2023 • edited Loading

jianan-gu commented Dec 1, 2023 •

edited

Loading

Titus-von-Koeller commented Dec 5, 2023 •

edited

Loading

jianan-gu commented Dec 5, 2023 •

edited

Loading

Xia-Weiwen commented Dec 19, 2023 •

edited

Loading

Xia-Weiwen commented Dec 21, 2023 •

edited

Loading