Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Extend bitsandbytes to support Intel hardware platforms #894

Open
1 of 3 tasks
jianan-gu opened this issue Dec 1, 2023 · 10 comments
Open
1 of 3 tasks

[RFC] Extend bitsandbytes to support Intel hardware platforms #894

jianan-gu opened this issue Dec 1, 2023 · 10 comments
Assignees

Comments

@jianan-gu
Copy link

jianan-gu commented Dec 1, 2023

Motivation

The current bitsandbytes library is bound with the CUDA platforms. However, we are seeing that there is a rapidly growing demand to run large language models (LLMs) on more platforms like Intel® CPUs and GPUs devices ("xpu" is the device tag for Intel GPU in PyTorch). Therefore, we aim at extending Intel® CPU and GPU ecosystem support and optimizations to bitsandbytes and offer the same scope of the lower-precision computation features (8bits and 4bits) as CUDA.

Approach

To provide the 8bits and 4bits features for Intel platforms, we propose two major changes as follows:

  1. A device abstraction that allows non-CUDA devices to be added to bitsandbytes easily. It contains a device backend abstraction that contains the key kernel interfaces to implement by each backend, a backend registration interface to add new device backends, and a kernel dispatching mechanism.
  2. Lightweight enabling of Intel CPU and GPU support on top of the device abstraction. We plan to leverage the PyTorch 2.x compiler stack and the custom kernels provided by Intel Extension for PyTorch (IPEX) to support Intel CPU and GPU without the needs of upstreaming native backend codes. This reduces the complexity of adding new devices to bitsandbytes and also reduces maintenance costs.

Device abstraction

We will extend CUDA dependency to Intel CPU/GPU in bitsandbytes device setup and init. We will provide common device abstractions for general devices (there will be no changes on CUDA).
Note that there is also no API or usage change for Huggingface users to use different devices with bitsandbytes.

from transformers import BitsAndBytesConfig
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16,
)
model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)

Lightweight integration for Intel CPU/GPU

We will do lightweight and simple integration to enable low-precision computation features, both 8bits and 4bits. We don't plan to add native backend code for Intel CPU and GPU in the first step. Instead, we employ PyTorch 2.x compilation and Intel® Extension for PyTorch to enable those features.

  • For performance-critical functions, such as GEMM, we will import IPEX as a Python module and use its API for computation. IPEX can provide the best performance for such functions across Intel devices (for example by adopting the 4th generation of Intel® Xeon® Scalable Processors with Intel® Advanced Matrix Extensions instruction set).
    • IPEX, as the speedup optional component, has already been integrated into Huggingface mainstream functionality tools like the Trainer Class and the accelerate repo to speed up training and inference. Similar to Huggingface, we will also integrate IPEX with bitsandbytes as a Python library dependency.

For example:

import intel_extension_for_pytorch
def cpu/xpu_igemmlt(*args, **kwargs):
    # Staff before computation
    C_i32 = torch.ops.torch_ipex.matmul_i8i8i32(A_i8, B_i8)  # GEMM computaion
    # Other staff
  • For other functions, we adopt the PyTorch 2.x compilation technology. We will implement them using PyTorch basic operators in Python and optimize the functions using torch.compile to get good performance. Intel is one of the major contributors to the torch.compile CPU backend in PyTorch and also hosted the torch.compile GPU backend in IPEX. The implementation can also work for other devices that support the PyTorch 2.x compiler stack.

For example:

@torch.compile
def double_quant_cpu/xpu(*args, **kwargs):
    # Implement double_quant for Intel CPU/GPU with PyTorch ops
    # torch.compile will generate kernel code and compile at runtime

Design

(1) Reorganize device_setup to support multiple devices

Intel CPU or GPU

  1. is_ipex_available
  2. import IPEX OPs (and also check Intel GPU device availability)

CUDA

  1. Remain the same, load from lib_cuda.so

1

(2) Device backend abstraction with key kernel interfaces

Key functions that are used in mainstream 8bits and 4bits:

  • Performance-critical:

| F.igemmlt |

  • Others:

| F.double_quant| F.mm_dequant| F.transform| F.extract_outliers| F.quantize_4bit| F.dequantize_4bit |

To extend the support of the above functions on Intel CPU/GPU (CUDA remains the same), we propose the following designs:
2

PR plans:

  • Enable device abstraction for Intel CPU/GPU and CUDA
    Adding options of init Intel CPU/GPU device but no implementations, CUDA remains the same.
  • Enable 8bits functionality for Intel CPU/GPU
    Adding implementations of 8bits functions for Intel CPU/GPU devices.
  • Enable 4bits functionality for Intel CPU/GPU
    Adding implementations of 4bits functions for Intel CPU/GPU devices.

Additional contents

Besides, we will also propose the PR to Transformers upstream to extend the usage of bitsandbytes API on multi-devices.

Transformers changes

  • _bitsandbytes_available
  • Not limited to CUDA devices available
  • Use CUDA/CPU and Intel GPU device here

3

@jianan-gu
Copy link
Author

cc @jgong5 @Xia-Weiwen @xiaolil1

@jgong5
Copy link

jgong5 commented Dec 1, 2023

@yao-matrix @jiqing-feng

@jianan-gu
Copy link
Author

Update the PR #898 for the above first plan.

Enable device abstraction for Intel CPU/GPU and CUDA
Adding options of init Intel CPU/GPU device but no implementations, CUDA remains the same.

@Titus-von-Koeller
Copy link
Collaborator

Titus-von-Koeller commented Dec 5, 2023

@jianan-gu Thanks for this high quality and well-written analysis / design document! Tim and I will look into this, as well as the PR, and come back to you soon.

Unfortunately, both the images come back with a 404. I bet this is because it's from a non-public repo, so it would show up correctly for you (with access).

Would you be so kind to make the images available to us?

@jianan-gu
Copy link
Author

jianan-gu commented Dec 5, 2023

Hi @Titus-von-Koeller, thanks for your reply and reminders, have reattached the images.
: )

@Titus-von-Koeller
Copy link
Collaborator

@jianan-gu I've talked with Tim about this and we're definitely going forward with this integration.

The design also looks good, but this question grants a deeper look. Since different hardware support for bitsandbytes will likely remain a topic going forward, we'd like to reflect on this design decision for a moment longer.

Tim is quite busy these days and currently at NeurIPS, so that might delay things a bit.

@Xia-Weiwen
Copy link

Xia-Weiwen commented Dec 19, 2023

Submitted PR jianan-gu#3 for the above plan step 2 (CPU part).

Enable 8bits functionality for Intel CPU/GPU
Adding implementations of 8bits functions for Intel CPU/GPU devices.

@jgong5
Copy link

jgong5 commented Dec 20, 2023

Submitted PR jianan-gu#2 for the above plan step 2 (CPU part).

You were submitting PR against @jianan-gu 's personal branch, not bitsandbytes mainline?

@Xia-Weiwen
Copy link

Submitted PR jianan-gu#2 for the above plan step 2 (CPU part).

You were submitting PR against @jianan-gu 's personal branch, not bitsandbytes mainline?

Yes. Because his PR is not merged, we cannot use the mainline as the base.

@Xia-Weiwen
Copy link

Xia-Weiwen commented Dec 21, 2023

We now have our PR to enable NF4 on CPU/XPU here: jianan-gu#4 for step 3 of our plan:

Enable 4bits functionality for Intel CPU/GPU
Adding implementations of 4bits functions for Intel CPU/GPU devices.

Since out PRs have dependencies to each other, we submitted those PRs in our repos instead of the mainline. We will rebase these PRs when everything is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants