Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support 4bit on CPU backend #1206

Conversation

Xia-Weiwen
Copy link

@Xia-Weiwen Xia-Weiwen commented May 10, 2024

Adds implementation for the following ops on CPU backend:

  • quantize_4bit
  • dequantize_4bit
  • gemv_4bit

Limitations:

  • quant_storage must be torch.uint8
  • compress_statistics is not supported yet (bnb_4bit_use_double_quant must be false)
  • fp4 is slow currently because there is no fused kernel yet.

Difference from CUDA implementation:

  • On CPU backend, it is not required that A is a vector to go to the fused dequant-gemm kernel. CUDA requires that. So, the op is called gemv_4bit. But on CPU backend, it's actually GEMM.
  • Different numerical accuracy due to different kernel implementations

Here is the code snippet of an example to run HuggingFace models with 4bit on CPU backend: https://gist.github.com/Xia-Weiwen/592d6e24e03f904a18692b3e27794c53. You will have to bypass CUDA checks in transformers to run.


cc @jiqing-feng @jgong5 @jianan-gu

@Xia-Weiwen Xia-Weiwen changed the title [WIP] Support NF4 on CPU backend [WIP] Support 4bit on CPU backend May 11, 2024
Comment on lines +450 to +452
out_dq = torch.empty(out_uint8.shape).to(quant_state.dtype)
for i in range(len(quant_state.code)):
out_dq[out_uint8 == i] = quant_state.code[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using index select will be faster out_dq = quant_state.code[out_uint8.to(torch.int32)].

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like torch.compile result of this code gives wrong results. And removing torch.compile results in lower performance. Let's keep this implementation for now.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bug in torch.compile? Can you submit a bug to PyTorch? I will try to fix it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I cannot reproduce the issue with the script below. May need more investigation.

import torch


NF4_DEQUANT_TABLE = torch.Tensor([
  -1.0,
  -0.6961928009986877,
  -0.5250730514526367,
  -0.39491748809814453,
  -0.28444138169288635,
  -0.18477343022823334,
  -0.09105003625154495,
  0.0,
  0.07958029955625534,
  0.16093020141124725,
  0.24611230194568634,
  0.33791524171829224,
  0.44070982933044434,
  0.5626170039176941,
  0.7229568362236023,
  1.0,
])


@torch.compile
def dequant_nf4_compile(t_in: torch.Tensor, out_dtype):
  return NF4_DEQUANT_TABLE[t_in.to(torch.int)].to(out_dtype)


def dequant_nf4_eager(t_in: torch.Tensor, out_dtype):
  return NF4_DEQUANT_TABLE[t_in.to(torch.int)].to(out_dtype)


x = torch.randint(0, 16, (1024, 1024), dtype=torch.uint8)

y1 = dequant_nf4_compile(x, torch.bfloat16)
y1 = dequant_nf4_compile(x, torch.bfloat16)
y2 = dequant_nf4_eager(x, torch.bfloat16)

print(torch.equal(y1, y2))
print("max diff =", torch.abs(y1 - y2).max())

@Xia-Weiwen Xia-Weiwen changed the title [WIP] Support 4bit on CPU backend Support 4bit on CPU backend May 21, 2024
@Xia-Weiwen Xia-Weiwen requested a review from jiqing-feng May 21, 2024 03:01
@Xia-Weiwen Xia-Weiwen marked this pull request as ready for review May 21, 2024 03:02
@jiqing-feng
Copy link
Contributor

jiqing-feng commented May 23, 2024

Hi @Titus-von-Koeller . Here is the test results on Intel 4th Gen Xeon CPU of this PR:
image

The big difference between NF4 and FP4 is that we can use fused ops in NF4, but they are not prepared in FP4. FP4 will also support fused ops and is supposed to get the same performance as NF4, maybe in the next Ipex release. Would you please review it? Thx!

test script

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import time

MAX_NEW_TOKENS = 64
model_id = "meta-llama/Llama-2-7b-chat-hf"

text = 'I am happy because'
tokenizer = AutoTokenizer.from_pretrained(model_id)
input_ids = tokenizer(text, return_tensors="pt").input_ids

print('Loading model {}...'.format(model_id))
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         bnb_4bit_quant_type="fp4",
                                         bnb_4bit_use_double_quant=False,
                                         bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, torch_dtype=torch.bfloat16)
# model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)

print('model dtype = {}'.format(model.dtype))

with torch.no_grad():
    # warmup
    model.generate(input_ids, max_length=MAX_NEW_TOKENS)
    model.generate(input_ids, max_length=MAX_NEW_TOKENS)
    print("warm-up complite")
    t0 = time.time()
    generated_ids = model.generate(input_ids, max_length=MAX_NEW_TOKENS, do_sample=False, num_beams=1)
    latency = time.time() - t0
    print(input_ids.shape)
    print(generated_ids.shape)
    result = "| latency: " + str(round(latency * 1000, 3)) + " ms |"
    print('+' + '-' * (len(result) - 2) + '+')
    print(result)
    print('+' + '-' * (len(result) - 2) + '+')

output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"output: {output}")

@Titus-von-Koeller
Copy link
Collaborator

Dear @Xia-Weiwen et al,

Unfortunately we're (mostly me alone) quite resource constrained and humbled by the workload associated with the multi-backend-refactor. I just talked with my colleague @younesbelkada about the topic how to best handle the next steps.

We both took a look at this PR and the one from AMD and think that at first glance everything looks really good. At this time, both me and Younes are not in a position to give detailed feedback and I need to focus on concretizing the path forward on how to integrate with the PyTorch dispatcher (tensor driven dispatch, as requested) through the torch.library Python-level APIs. After extensive research and yesterday's consultation with 3 PyTorch devs at Meta that are experts on the topic I need to focus on making this new input concrete.

However, for the purpose of iterative progress (as agreed in our prior conversations), we've decided to already go ahead and merge both the open Intel and AMD branches into multi-backend-refactor, where interested parties can then compile from source and give the new functionality (we're so excited and grateful about this!) a thorough testing.

Once we've made some progress on the torch.library based refactor, I'll next focus on enabling the nightly releases for that branch as well. We're also looking forward to your feedback on the this torch.library / tensor-driven dispatch topic once the code is there on the basis of which to discuss (and refactor the backend specific code towards that new target, after we agreed with all of you that this is the right path).

Among other things, there's also been extensive ongoing work in the background on things like moving BNB to a new independent/non-profit Github org, but under the umbrella of Hugging Face and the support of their infra team for managing the complexities of the CI/CD backend and runners. Also, we're working to make Github runners for the different hardware platforms a reality (thanks for your help on that!).

Thanks again for the good work and active collaboration! ❤️ 🚀

@Titus-von-Koeller Titus-von-Koeller merged commit 701c5aa into bitsandbytes-foundation:multi-backend-refactor May 24, 2024
1 of 2 checks passed
@Titus-von-Koeller
Copy link
Collaborator

Titus-von-Koeller commented May 24, 2024

P.S. Also see this: README: asking for help from volunteer alpha testers

Let us know if you have further thoughts on this and how you think it's best to communicate about this.

@Xia-Weiwen Xia-Weiwen requested a review from jgong5 May 27, 2024 00:46
@Xia-Weiwen
Copy link
Author

Xia-Weiwen commented May 27, 2024

Hi @Titus-von-Koeller Thanks a lot for your help on this. We are glad to provide feedbacks on the adoption of torch.library. Please let use know when there is any update.
Also, we would love to volunteer to conduct regular tests on Intel CPU/GPU as alpha testers. I think we will need to be aligned on many aspects of the tests, such as test code base, methods, frequency, scopes and how we sync and publish the results. Maybe we can create an issue to track this. I will have discussions with my colleagues and come back later.

@jgong5
Copy link

jgong5 commented May 27, 2024

At this time, both me and Younes are not in a position to give detailed feedback and I need to focus on concretizing the path forward on how to integrate with the PyTorch dispatcher (tensor driven dispatch, as requested) through the torch.library Python-level APIs. After extensive research and yesterday's consultation with 3 PyTorch devs at Meta that are experts on the topic I need to focus on making this new input concrete.

Hi @Titus-von-Koeller May I learn more details about how you are going to refactor things via torch.library? I guess this is one of the official ways of integrating native backend implementations with PyTorch, which provides python bindings and backend dispatching mechanism. I shared similar comments earlier too: #898 (comment).

Meanwhile, it would be beneficial to allow the flexibility of backend integration without adding native code explicitly to bitsandbytes too, like optimizing via torch.compile as what this PR does and integration via third-party Python extensions like "ipex" (Intel extension for PyTorch). This is a more light-weight approach than adding native code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants