Skip to content
Derek Slenk edited this page Sep 19, 2023 · 72 revisions

September 2023, tested on 7900 XTX

Following the great instructions from August and using the docker image, this runs on the 7900 XTX with a few changes, most notably

export HSA_OVERRIDE_GFX_VERSION=11.0.0 #7900 xtx natively works with the gfx1100 driver
make hip ROCM_TARGET=gfx1100

The rest of the steps are the same

August 2023, tested on 6900 XT and 6600 XT

Due to the great work of Odonata (Discord, github @edt-xx), the hardware of oceanmasterza (Discord), and the help of epicx (Discord, GitHub @bennmann), we have the below AMD instructions.

According the the author of the bitsandbytes ROCM port @arlo-phoenix, using a Docker image is recommended (both rocm/pytorch and rocm/pytorch-nightly should work). See port discussion here.

On host machine, run:

docker pull rocm/pytorch-nightly
sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rocm/pytorch-nightly

In the running image, run:

cd /home
export HSA_OVERRIDE_GFX_VERSION=10.3.0

# Install bitsandbytes with ROCM support
git clone https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6.git bitsandbytes
cd bitsandbytes
make hip ROCM_TARGET=gfx1030
pip install pip --upgrade
pip install .

# Install Petals
cd ..
pip install --upgrade git+https://github.com/bigscience-workshop/petals

# Run server
python -m petals.cli.run_server petals-team/StableBeluga2 --port <an open port> --torch_dtype float16

Running the model in bfloat16 is also supported but slower than in float16.

Multi-GPU process (--tensor_parallel_devices) is still not tested (docker --gpu flag may not function at this time and other virtualization tools may be necessary).

July 2023, tested on 6900 XT and 6600 XT

Contributed by: @edt-xx, @bennmann

Tested on:

  • AMD 6600 XT tested July 24th, 2023 on Arch Linux with Rocm 5.6.0, mesa 22.1.4
  • AMD 6900 XT tested April 18th, 2023 on bare metal Ubuntu 22.04 (no docker/anaconda/container). Tested with ROCM 5.4.2
  • Untested on 7000 series, however 7000s may have much better performance as AMD added machine learning tensor library and better hardware support (vs ray tracing only on 6000 series)

Guide:

  • use the mesa-clover and mesa-rusticl opencl variants

  • add export HSA_OVERRIDE_GFX_VERSION=10.3.0 to your environment (put it to /home/user/.bashrc on ubuntu - this tricks ROCM to work on more consumer based cards like the 6000 series)

  • install ROCM. Use this tutorial for Arch Linux: https://wiki.archlinux.org/title/GPGPU

  • create and activate a venv for petals using python 3.11

    • python -m venv <yourvenvpath>
    • cd <yourvenvpath>
    • source bin/activate
  • in the venv install pytorch, nightly version, with the command generated on by the website: https://pytorch.org/get-started/locally/

  • install the Petals version with AMD GPU support:

    pip install git+https://github.com/bigscience-workshop/petals@amd-gpus

    This branch uses an older version of bitsandbytes patched to have AMD GPU support (developed by @brontoc and Titaniumtown). This means that you won't be able to use the 4-bit qunatization (--quant_type nf4) and LoRA adapters (the --adapters argument). The server will use 8-bit quantization (int8) for all models by default.

    Tip: You can set your fans to full speed or close to it before starting Petals (the default Linux fan profile for AMD GPUs is not good on some cards): rocm-smi --setfan 99%

  • run petals using:

    python -m petals.cli.run_server petals-team/StableBeluga2

    Tip: You can monitor temperature and woltage by running this: rocm-smi && rocm-smi -t