Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: ROCm support (AMD GPU) #107

Open
gururise opened this issue Dec 11, 2022 · 42 comments
Open

Feature Request: ROCm support (AMD GPU) #107

gururise opened this issue Dec 11, 2022 · 42 comments
Labels
enhancement New feature or request high priority (first issues that will be worked on) Low Risk Risk of bugs in transformers and other libraries

Comments

@gururise
Copy link

Could you please add official AMD ROCm support to this library? An unofficial working port already exists:

https://github.com/broncotc/bitsandbytes-rocm

Thank You

@TimDettmers
Copy link
Collaborator

Amazing! Thank you for bringing this to my attention. I will try to get in touch with the author of the ROCm library and support AMD GPUs by default.

@TimDettmers TimDettmers added the enhancement New feature or request label Feb 2, 2023
@YellowRoseCx
Copy link

Amazing! Thank you for bringing this to my attention. I will try to get in touch with the author of the ROCm library and support AMD GPUs by default.

that would be AMAZING! especially with you recently adding 8 bit support. I tried to make my own merge of the forks but I don't really know what I'm doing and don't think I did it correctly

@anonymous721
Copy link

anonymous721 commented Feb 14, 2023

If the ROCm fork does get merged in, would the Int8 Matmul compatibility improvements also work for AMD GPUs?

@deftdawg
Copy link

@TimDettmers, curious if AMD support any nearer to being merged? @agrocylo made a PR (#296) based somewhat on @broncotc's fork...

@gururise
Copy link
Author

EDIT: A slightly newer version branched from v0.37 available here:
https://github.com/Titaniumtown/bitsandbytes-rocm/tree/patch-2

@elukey
Copy link

elukey commented Jun 22, 2023

The Wikimedia foundation is really interested in the ROCm support too, since Nvidia is not viable for us due to open-source constraints. @TimDettmers we offer any help (testing/review/etc..) to help merge this feature, it would be really great for the ML open source ecosystem. Thanks in advance!

@Aria-K-Alethia
Copy link

Aria-K-Alethia commented Jul 17, 2023

EDIT: A slightly newer version branched from v0.37 available here: https://github.com/Titaniumtown/bitsandbytes-rocm/tree/patch-2

Hi,
I'm also seeking an AMD-GPU-compatible version.
I tried your patch-2 version but the code still cannot work.
The error info looks like:

  File "/home/.local/lib/python3.8/site-packages/bitsandbytes/autograd/__init__.py", line 1, in <module>
    from ._functions import undo_layout, get_inverse_transform_indices
  File "/home/.local/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 9, in <module>
    import bitsandbytes.functional as F
  File "/home/.local/lib/python3.8/site-packages/bitsandbytes/functional.py", line 17, in <module>
    from .cextension import COMPILED_WITH_CUDA, lib
  File "/home/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py", line 74, in <module>
    raise RuntimeError('''
RuntimeError: 
        CUDA Setup failed despite GPU being available. Inspect the CUDA SETUP outputs above to fix your environment!
        If you cannot find any issues and suspect a bug, please open an issue with detals about your environment:
        https://github.com/TimDettmers/bitsandbytes/issues

I use AMD MI200 card.
Do you have any idea on this?
Many thanks.

@PatchouliPatch
Copy link

Hello, I was wondering how far-off the ROCm support is. I'm trying to see if my 7900XTX will be useful in a project of mine. The Llama2 quick start guide makes use of bitsandbytes, and as far as I know there isn't any other alternatives.

@jiagaoxiang
Copy link

Found this rocm version of bitsandbytes: https://github.com/Lzy17/bitsandbytes-rocm/tree/main

@mauricioscotton
Copy link

The only rocm version that worked for me on GFX900 was this one: https://github.com/agrocylo/bitsandbytes-rocm
All the others failed to compile/install
(Rocm 5.2)

@st1vms
Copy link

st1vms commented Nov 23, 2023

For anyone that needs a patch for RDNA3 cards I created this fork https://github.com/st1vms/bitsandbytes-rocm-gfx1100

This fork patches the Makefile for targeting gfx1100 amdgpu module along latest ROCM and clang17...and fixes some hip include warnings.

Works with a RX7900XT and ROCM5.7 (along with torch-rocm5.7) installed.

Anyway there should be a better way of targeting the correct amdgpu module in the build system...

Edit:

Probably won't work with libraries requiring version > 0.35

@Wintoplay
Copy link

@st1vms There is a problem.
The version of BNB is 0.35.4 which is kind of outdated, and the latest version of Peft requires bitsandbytes>=0.37.0

@st1vms
Copy link

st1vms commented Dec 11, 2023

@st1vms There is a problem.
The version of BNB is 0.35.4 which is kind of outdated, and the latest version of Peft requires bitsandbytes>=0.37.0

If that fork still works for you, maybe it is ok to just change the version number.

You can test if the library works with:

python -m bitsandbytes

If that is the case, try editing the version number in the setup.py of the fork before building and installing it, i.e. change it to 0.37.0 and see if Peft works...

@Wintoplay
Copy link

Wintoplay commented Dec 11, 2023

@st1vms I tried BNB 0.39.0
The dependencies seem fine. However, when I tried to lora finetune according to this https://colab.research.google.com/drive/1jCkpikz0J2o20FBQmYmAGdiKmJGOMo-o?usp=sharing

The Jupyter kernel crash, reason: undefined

Screenshot from 2023-12-12 00-11-47

image

@st1vms
Copy link

st1vms commented Dec 11, 2023

@st1vms I tried BNB 0.39.0
The dependencies seem fine. However, when I tried to lora finetune according to this https://colab.research.google.com/drive/1jCkpikz0J2o20FBQmYmAGdiKmJGOMo-o?usp=sharing

The Jupyter kernel crash, reason: undefined

Screenshot from 2023-12-12 00-11-47

Well, the fork is probably obsolete already for some libraries, you should look for updated ones.

@Wintoplay
Copy link

@st1vms
I retried with a new virtual env and change from .ipynb to .py
This is the result.

(torch3) win@win-MS-7E02:/mnt/1df6b45e-20dc-41ca-9a04-b271fd3a4940/Learn$ /usr/bin/env /home/win/torch3/bin/python /home/win/.vscode-oss/extensions/ms-python.python-2023.20.0-universal/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 60843 -- /mnt/1df6b45e-20dc-41ca-9a04-b271fd3a4940/Learn/finetune.py
/home/win/torch3/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/home/win/torch3/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/home/win/torch3/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:14<00:00, 7.21s/it]
trainable params: 8388608 || all params: 6666862592 || trainable%: 0.12582542214183376
0%| | 0/200 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
/home/win/torch3/lib/python3.10/site-packages/torch/utils/checkpoint.py:461: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/home/win/torch3/lib/python3.10/site-packages/bitsandbytes-0.41.0-py3.10.egg/bitsandbytes/autograd/_functions.py:231: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")

=============================================
ERROR: Your GPU does not support Int8 Matmul!

python: /mnt/1df6b45e-20dc-41ca-9a04-b271fd3a4940/bitsandbytes-rocm-gfx1100/csrc/ops.cu:347: int igemmlt(cublasLtHandle_t, int, int, int, const int8_t *, const int8_t *, void *, float *, int, int, int) [FORMATB = 3, DTYPE_OUT = 32, SCALE_ROWS = 0]: Assertion `false' failed.

@gururise
Copy link
Author

@st1vms I tried BNB 0.39.0
The dependencies seem fine. However, when I tried to lora finetune according to this https://colab.research.google.com/drive/1jCkpikz0J2o20FBQmYmAGdiKmJGOMo-o?usp=sharing
The Jupyter kernel crash, reason: undefined
Screenshot from 2023-12-12 00-11-47

Well, the fork is probably obsolete already for some libraries, you should look for updated ones.

Can someone post to this thread any updated forks? The lack of proper BnB support is really holding back the AMD cards.

@gururise
Copy link
Author

Looks like things may finally move forward with official support in the not too distant future! Hope with ROCm 6.x we can finally see support merged into this repo.

@TimDettmers
Copy link
Collaborator

Sorry for taking so long on this. I am currently onboarding more maintainers and we should see some progress on this very soon. This is one of our high-priority issues.

@TimDettmers TimDettmers added high priority (first issues that will be worked on) Low Risk Risk of bugs in transformers and other libraries labels Jan 1, 2024
@SakshamG7
Copy link

Would love to see ROCM support, keep doing your good work

@PatchouliPatch
Copy link

if I may ask, what's the progress so far?

@Airradda
Copy link

if I may ask, what's the progress so far?

If you haven't already seen it, there was a comment made in the discussions with an accompanying tracking issue for general cross-platform support rather than just AMD/ROCM support. To that end it appears it is currently in the planning phase.

@amathews-amd
Copy link
Contributor

@TimDettmers @Titus-von-Koeller , we are at ~95% parity for bnb for https://github.com/ROCm/bitsandbytes/tree/rocm_enabled on Instinct class gpus, and working to close the gaps on Navi. At this point, we should be seriously considering upstreaming. Could you drop me an email at [email protected], and we can set up a call to discuss further.
cc: @sunway513 @Lzy17 @pnunna93

@chauhang
Copy link

@amathews-amd I tired compiling ROCm version of BnB from the rocm_enabled branch, but it is failing with errors on AMD MI250x. Do you have any suggestions for how to resolve the issue?

@pnunna93
Copy link
Contributor

@chauhang Could you try with rocm 6.0? You can use this docker - rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2 and install bitsandbytes directly.

@chauhang
Copy link

@pnunna93 I am already using ROCm 6.0 -- have added details of the pytorch environment here.

@pnunna93
Copy link
Contributor

@chauhang, you can skip the hipblaslt update and install bitsandbytes directly then. Please let me know if you face any issues.

@ehartford
Copy link

I was using arlo-phoenix fork. https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6/tree/rocm

Should I use the ROCm fork instead? https://github.com/ROCm/bitsandbytes/tree/rocm_enabled

@pnunna93
Copy link
Contributor

Yes, its updated for rocm 6

@matthewdouglas
Copy link
Member

@TimDettmers @Titus-von-Koeller , we are at ~95% parity for bnb for https://github.com/ROCm/bitsandbytes/tree/rocm_enabled on Instinct class gpus, and working to close the gaps on Navi. At this point, we should be seriously considering upstreaming. Could you drop me an email at [email protected], and we can set up a call to discuss further. cc: @sunway513 @Lzy17 @pnunna93

I've often had trouble understanding the state of GPU support in ROCm. So with that said, I have some clarification questions:

  • Can we clarify on what we mean by "Instinct-class" GPUs?
    • The ROCm 6.0.2 docs suggest to me this is all CDNA, so MI100 and newer? Or is MI50 expected to work also?
  • What is the intention for Navi support?
    • Is this for RDNA2/RDNA3 only?
  • Is there intent to support with ROCm < 6?

I'd like to be able to help get this merged, but need to figure out the constraints. The only AMD GPUs that I have on hand (RX 570 and R9 270X) aren't going to cut it.

The other issue is how far behind main this is. Ideally this could be implemented as a separate backend as proposed in #898. We would want to change to use CMake for building. I also think that it'd be better to unify the C++/CUDA code with the hipify code and take care of most of the changes with conditional compilation.

@amathews-amd
Copy link
Contributor

Sure, here is the official list: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-gpus. For BnB, since we are at initial enablement, it is dependent on where we are testing it (both hardware and software versions). We are currently focusing on MI250/MI300/gfx1100, and newer ROCm versions for testing.
We are assessing #898 as well, to see how we can adapt the rocm_enabled branch, so it fits in the new design.

@Titus-von-Koeller
Copy link
Collaborator

@TimDettmers @Titus-von-Koeller , we are at ~95% parity for bnb for https://github.com/ROCm/bitsandbytes/tree/rocm_enabled on Instinct class gpus, and working to close the gaps on Navi. At this point, we should be seriously considering upstreaming. Could you drop me an email at [email protected], and we can set up a call to discuss further. cc: @sunway513 @Lzy17 @pnunna93

@amathews-amd I sent you an invite to our bnb-crossplatform slack, to the email you provided. Of course we should invite your other collaborators as well. Can we talk there and coordinate on scheduling a kickoff call?

@Titus-von-Koeller
Copy link
Collaborator

@amathews-amd the changes introduced through #898 are not final and weren't merged onto main but instead multi-backend-refactor in order to keep main releasable and allow us to iteratively arrive at a solution that works for all parties involved.

This means that there's ongoing work where a series of PRs onto multi-backend-refactor will concretize things further in tight collaboration with the community. Feel free to pitch in with opinions and concrete work, in case there's something that catches your eye and fits your expertise.

@PatchouliPatch
Copy link

is there a place where we can track the progress on the implementation of this?

@PatchouliPatch
Copy link

by the way, does anyone know where I can submit bug reports for https://github.com/ROCm/bitsandbytes/tree/rocm_enabled? going to the page, there's no Issues tab.

@Titus-von-Koeller
Copy link
Collaborator

Titus-von-Koeller commented May 13, 2024

by the way, does anyone know where I can submit bug reports for https://github.com/ROCm/bitsandbytes/tree/rocm_enabled? going to the page, there's no Issues tab.

Maybe @pnunna93 or @amathews-amd from AMD can help with that? I'm sure they'd appreciate your report.

is there a place where we can track the progress on the implementation of this?

Right now the best place is to look at PRs and recently merged PRs to the multi-backend-refactor branch.

We should make significant progress in the next weeks and make a alpha/beta release built off of that branch available as a nightly package release relatively soon.

(@PatchouliPatch)

@pnunna93
Copy link
Contributor

by the way, does anyone know where I can submit bug reports for https://github.com/ROCm/bitsandbytes/tree/rocm_enabled? going to the page, there's no Issues tab.

We created issues tab - https://github.com/ROCm/bitsandbytes/issues , please feel free to open any bug reports.

@katanazero86
Copy link

katanazero86 commented Jul 7, 2024

We hope that BNB, which operates in the ROCm environment, will be officially released as soon as possible.

For a few days, I did an example of fine tuning in the Redeon XTX7900 + ROCm 6.1.2 environment, but the issue of BNB not being recognized really gave me a headache.

I felt like this was why everyone was buying Nvidia graphics cards.
And, https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference.html

I saw this section and proceeded with it, but the BNB included in that section was not recognized properly.

$ amd-smi version
AMDSMI Tool: 24.5.1+c5106a9 | AMDSMI Library version: 24.5.2.0 | ROCm version: 6.1.2
  • torch: 2.3.1+rocm6.0
  • torchvision: 0.18.1+rocm6.0

success code

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
)
from peft import (
    LoraConfig,
    get_peft_model
)
from trl import SFTTrainer, SFTConfig

from huggingface_hub import login

login("yourToken")

# Base model and tokenizer names.
base_model_name = "google/gemma-2b-it"

# Load base model to GPU memory.
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, trust_remote_code=True).to(device)

# Load tokenizer.
tokenizer = AutoTokenizer.from_pretrained(
    base_model_name,
    trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Dataset for fine-tuning.
training_dataset_name = "daekeun-ml/naver-news-summarization-ko"
training_dataset = load_dataset(training_dataset_name, split="train")

# Check the data.
print(training_dataset)

# Dataset 11 is a QA sample in English.
print(training_dataset[11])
print(training_dataset[0])

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM"
)

sft_config = SFTConfig(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    # optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=50,
    learning_rate=4e-5,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard",
    dataset_text_field="summary",
)

# View the number of trainable parameters.

peft_model = get_peft_model(base_model, peft_config)
peft_model.print_trainable_parameters()

# Initialize an SFT trainer.

sft_trainer = SFTTrainer(
    model=base_model,
    train_dataset=training_dataset,
    peft_config=peft_config,
    tokenizer=tokenizer,
    args=sft_config,
)

# Run the trainer.
sft_trainer.train()

fail code

import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
)
from peft import (
    LoraConfig,
    get_peft_model
)
from trl import (
    SFTTrainer,
    SFTConfig
)

from huggingface_hub import login

login("yourToken")

from datasets import load_dataset

dataset = load_dataset("daekeun-ml/naver-news-summarization-ko")
print(dataset['train'][0])

# 모델 로드
BASE_MODEL = "google/gemma-2b-it"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
base_model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, add_special_tokens=True)


def generate_prompt(example):
    prompt_list = []
    for i in range(len(example['document'])):
        prompt_list.append(r"""<bos><start_of_turn>user
다음 글을 요약해주세요:

{}<end_of_turn>
<start_of_turn>model
{}<end_of_turn><eos>""".format(example['document'][i], example['summary'][i]))
    return prompt_list


train_data = dataset['train']
print(generate_prompt(train_data[:1])[0])

lora_config = LoraConfig(
    r=6,
    lora_alpha=8,
    lora_dropout=0.05,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

peft_model = get_peft_model(base_model, lora_config)
peft_model.print_trainable_parameters()

sft_config = SFTConfig(
    output_dir="./outputs",
    max_steps=3000,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    # optim="paged_adamw_8bit",
    learning_rate=2e-4,
    fp16=False,
    bf16=False,
    logging_steps=100,
    push_to_hub=False,
    report_to="tensorboard",
)

sft_trainer = SFTTrainer(
    model=base_model,
    train_dataset=train_data,
    peft_config=lora_config,
    tokenizer=tokenizer,
    args=sft_config,
    formatting_func=generate_prompt
)

sft_trainer.train()

torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 1.46 GiB. GPU 
  0%|          | 0/3000 [00:02<?, ?it/s]

@pnunna93
Copy link
Contributor

pnunna93 commented Jul 8, 2024

Hi @katanazero86 , Sorry for the trouble you have faced. The torch version seems to be for rocm 6.0, please install 6.1 torch and rebuild bitsandbytes.

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.1/

@katanazero86
Copy link

Hi @pnunna93,
Thank you for your reply. When I installed version 6.1, the error message changed.

image

RuntimeError: CUDA error: HIPBLAS_STATUS_NOT_SUPPORTED when calling `HIPBLAS_STATUS_NOT_SUPPORTED`
  0%|          | 0/3000 [00:07<?, ?it/s]

I guess the versions are not compatible. It's so annoying. T.T

@Lzy17
Copy link

Lzy17 commented Jul 15, 2024

Hi @katanazero86,

Sorry for the trouble. I have tested your code within this Docker image (docker pull rocm/rocm-terminal:6.1.2), and both runs executed without error. Could you try using this Docker image if possible?

Additionally, please install PyTorch as suggested by @pnunna93 within the docker container:

pip install networkx==3.1
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.1/

Thanks!

@katanazero86
Copy link

@Lzy17

Thank you for answer. When I have time later, I will try again using a Docker image :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority (first issues that will be worked on) Low Risk Risk of bugs in transformers and other libraries
Projects
None yet
Development

Successfully merging a pull request may close this issue.