-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROCM Support #47
Comments
+1 Would be great to get this working on AMD hardware if there's no hard technical limitation blocking compatibility. Newer Radeon cards often have quite high VRAM and work really well for ML, main issue remaining is providing library compatibility. |
I personally do not have time to implement ROCm support. If you have experience with ROCm I could guide you through the steps to get a working solution. |
An unofficial working port exists here: https://github.com/broncotc/bitsandbytes-rocm Would be possible to merge the changes into the official port? |
Edit: It seems I just attempted to install using the wrong flags while building. It works great. |
EDIT: A slightly newer version based on v0.37 available here: |
After much searching, I did find something that works - so I'll share it here.
I have found this makes bitsandbytes work with some things on my GPU... |
My AMD kohya setup worked with Lion but I want to try use AdamW8bit like the others. For building at the "make hip" step, I performed some steps like,
There are also other steps that I did as some errors appeared during the "make hip" setup battle, but I am not sure whether they mattered or correct:
(All these steps only made the bnb-rocm "make hip" errors and Kohya bnb import/usage errors go away, |
The Adam8bit optimizer issue probably wasn't from ROCm itself, but the general issue that's been fixed in the latest version 0.41.1. I made my own fork that just uses defines to make the CUDA code work with HIP, so it's easier to keep up to date and supports the latest version. Since the patch was just in python though, you could just apply it to the fork you are currently using as well. |
Hi @arlo-phoenix , nice fork, what would i comment out of the Makefile to test 6900 XT (gfx1030)? I have already "export HSA_OVERRIDE_GFX_VERSION=10.3.0" |
You shouldn't have to change anything special afaik just doing git clone https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6.git bitsandbytes
cd bitsandbytes
#see makefile comments under hip for more info
#using pip install . since python setup.py install is deprecated
make hip
pip install . works on a newly created docker using rocm 5.6 (the one I'm using). This has all the environment variables including path setup and since ROCm doesn't have that much documentation I really recommend it. If you don't wanna use docker, you'll need to make sure, that hipcc is at /usr/bin or change the makefile accordingly (I'll probably update it myself at some point to work better out of the box, I just copied it from a previous port). You can also set the env var ROCM_HOME to the path to your rocm install (normally /opt/rocm) if the automatic find function doesn't work and add For testing you can then go in the tests folder and use pytest test_optim.py to see what works and what doesn't. I'd be careful with the other tests, some froze my PC. I'd say this is probably because I just excluded some parts of the code as they didn't compile because of the different warp size (it's double at AMD and some static_asserts failed because of that) or the missing hipBLASlt stuff or which I hope it's not actually ROCm library issues. I'll see if I can fix that, I want to try out QLora at some point and I'm pretty sure I need some atleast, but haven't tried that out yet. TLDR:
|
thank you very much @arlo-phoenix - one more step forward, now the "make hip" is struggling with the below paths for my ROCM 5.6:
can you check to make sure the $ROCM_HOME include makefile references match your installation paths in your fork? is this just my installation of Ubuntu and amdgpu-install rocm PATHs being non-standard in some way? |
I am able to build arlo-phoenix's fork of bitandbytes using the docker image rocm/pytorch. The nightly version of the image gives errors in make hip though. |
All header should be found under $ROCM_HOME/include. To control where it searches for the libraries and headers you could try changing these two HIP_INCLUDE := -I $(ROCM_HOME)/include -I $(ROOT_DIR)/csrc -I $(ROOT_DIR)/include
HIP_LIB := -L $(ROCM_HOME)/lib -lhipblas -lhiprand -lhipsparse like adding -I /opt/rocm/hip/include to the include thing and do that for literally every library, might need to change some headers as well so this is more like a last measure. Does doing |
Ok that's weird... I haven't pulled it in a while, but I just tried it out again and it worked for me. But it's great that rocm/pytorch still works, hope it's useful! I didn't really look too deep into the images and thought just the pytorch version would be different, but apparently it's not, good to know. |
I've been using bitandbytes as a requirement for petals. The latest port to rocm ( git clone https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6.git ) allows a simple petals install in a docker container. Petals, defaults to a quant-type of nf4. using this gets a bus error. Things work if --quant-type int8 is passed to petals. It would be really nice if nf4 could be made to work in the arlo-phoenix (btw thanks for the port! ) version. In the interest of repeatabilty:
(in the running image)
|
Should work now. Atleast the tests from bitsandbytes for nf4 succeeded. See the README in the fork for updated install instructions (I improved the Makefile to take the ROCM_TARGET as an argument). The reason I didn't add this from the beginning and just did an ifndef around it, so it wasn't used with ROCm is, because I didn't know enough about Warp sizes from different devices. So that you can use this, your GPU needs to support wave32. It's supported since RDNA (https://en.wikipedia.org/wiki/RDNA_(microarchitecture)), so I think it should work for most people. It will even compile if your GPU doesn't support it, since I forcefully redefine __AMDGCN_WAVEFRONT_SIZE (will throw a lot of warnings and should not be done, but hey it works and couldn't find any alternative) . For whatever reason this takes the wrong value for gfx10.. GPU's and the priority for the issue ROCm/MIOpen#1431 isn't high anymore, so expect the workaround to stay for a while. To check if your gpu supports it, call /opt/rocm-5.6.0/include/rocprim/block/block_load.hpp:776:5: error: static assertion failed due to requirement 'BlockSize % ::rocprim::device_warp_size() == 0': BlockSize must be a multiple of hardware warpsize
/opt/rocm-5.6.0/include/rocprim/block/block_store.hpp:505:5: error: static assertion failed due to requirement 'BlockSize % ::rocprim::device_warp_size() == 0': BlockSize must be a multiple of hardware warpsize
static_assert(BlockSize % ::rocprim::device_warp_size() == 0, I have no bloody idea if it's possible to make a workaround. If all modern AMD GPU's support wave32, shouldn't matter, but I couldn't find shit about the CDNA lineup, so if it doesn't support wave32 (which I highly doubt) that's a them problem .-.. You can also get warpSize from torch so it would be possible to catch BLOCK_SIZE 64 when the waveSize is 64, didn't test yet if that also returns the wrong size, probably does. |
arlo-phoenix. Your update of 8 Aug works great with petals. There was a bug in petals measuring performance :-/ so ignore the following numbers: Inferences when from 22.3 to 333.3 tokens/sec (15 x faster) and forward pass thruput when from 4783 to about 1433376 tokens/sec ( 300 x faster)! |
Bitsandbytes was not supported windows before, but my method can support windows.(yuhuang) 3 J:\StableDiffusion\sdwebui\py310\python.exe -m pip uninstall bitsandbytes-windows 4 J:\StableDiffusion\sdwebui\py310\python.exe -m pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl Replace your SD venv directory file(python.exe Folder) here(J:\StableDiffusion\sdwebui\py310) |
Is there anything we can do to get rocm support in main branch? AMD is only going to get bigger market share moving forward and rocm is already supported by most major frameworks/libraries. We can go around bnb for inference, but need it for fine-tuning. |
i need support gfx908, amd instinct mi100 |
This should be on main in the next 2 months. We re actively working on this, among other high impact things. Thanks for your patience. You can already pip install the alpha release: Please reference the installation instructions in our official docs and give us feedback about your experience to help us deliver the best once merging to main. Thanks! |
bitsandbytes seems to be hardcoded to search for specific cuda libraries which don't seem to be provided the same way by rocm
The text was updated successfully, but these errors were encountered: