-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: is scaled_dot_product_attention part of flash attention? #79
Comments
SDPA implementations are managed by PyTorch itself and will not automatically use Flash Attention implementation from external libraries for its computation. You can monkey patch like this to override SDPA if needed: howiejay/navi_support only implements the forward pass, thus you can't use it for training. If you are interested, you can try the branch in this PR, which updates AOTriton and supports Navi31: |
is this going to compile on wsl2 linux? i have been banging my head against my desk trying to get this to compile for the past month. there have been multiple issues, but AOTriton was definitely one of them. It would be really nice if the AMD repo had nightly whls for python 3.10 as well as python3.11 and python3.12 including:
|
Almost everything can be compiled in ROCm in WSL. For PyTorch, you can follow the steps here:
For TorchVision and TorchAudio, they can be easily compiled once corresponding PyTorch is installed. For Flash Attention, the branch in #76 is still in early stage. You can make it work with some fixes and understandings, but I haven't been able to get extraordinary performance from it. For xFormers, it should only work on CDNA GPUs at the moment and I don't know any repo that work for Navi3x. For BitsAndBytes, the official repo has a branch https://github.com/bitsandbytes-foundation/bitsandbytes/tree/multi-backend-refactor, you can build it there and it will work just fine (only 4bit?). I have been training some LLMs using https://github.com/hiyouga/LLaMA-Factory on my RX 7900 XTX these weeks, with vanilla PyTorch and custom BitsAndBytes in WSL. While it works for single GPU training, the training performance is about 50% of a RTX 4090 D. |
i'm having toruble setting AOTriton to just build the |
Add a new line of |
alrightty i was trying to
|
seemingly incompatible with WSL2's lack of |
@evshiron are you able to identify what i should look into for this issue?
|
Try uninstalling
See also: |
that worked, and i recompiled pytorch with some of your settings that you linked. pytorch compiled, which is nice, but it's not working with anything.
i cant compile torchvision, torchaudio, and like you mentioned, xformers is also not compiling. |
What's the exception in that log? If the compilation of TorchVision or TorchAudio doesn't work, you can put their logs here. I can't give advice without a detailed context. |
very close with torchvision. This last error i do not know how to fix.
i also had to change the name of @evshiron i really appreaciate the help it's been so dificult to navigate how to do this. |
Steps to build PyTorch, TorchVision and TorchAudio in WSL with ROCm integration: # in a new/clean shell session without PATH interference.
# build torch
git clone https://github.com/pytorch/pytorch
cd pytorch
git fetch origin pull/134498/head:pull/134498
git checkout pull/134498
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
pip3 install numpy==1.26.4 wheel
export PYTORCH_ROCM_ARCH=gfx1100
export AMDGPU_TARGETS=gfx1100
export HCC_AMDGPU_TARGET=gfx1100
export MAX_JOBS=8
echo 2.5.0 > version.txt
# add -DTARGET_GPUS=Navi31 option
vi cmake/External/aotriton.cmake
python3 tools/amd_build/build_amd.py
python3 setup.py bdist_wheel
# install torch
pip3 install dist/torch*.whl
# build torchvision
git clone https://github.com/pytorch/vision
cd vision
echo 0.20.0 > version.txt
python3 setup.py bdist_wheel
# install torchvision
pip3 install dist/torchvision*.whl
cd ..
# build torchaudio
git clone https://github.com/pytorch/audio
cd audio
echo 2.5.0 > version.txt
python3 setup.py bdist_wheel
# install torchaudio
pip3 install dist/torchaudio*.whl You can now install those EDIT: this environment variable is needed to enable AOTriton for Navi31 before running your application:
|
@evshiron this was the original issue i had when trying to compile vision. the previous error i showed you was for the rocm repo of vision. torch audio did work
|
nvJPEG should not be used unless you have CUDA toolkit installed, and you can https://github.com/ROCm/vision is really old and you should not use it. |
worked! you are a god now i just have to try xformers testing the env:
I guess i need to rebuild? |
xFormers will not work on Navi3x. Only If not, you have to install There are various Flash Attention implementations in ROCm ecosystem, but few of them have superior performance. This branch has a battle tested CK-based implementation that works with Navi3x, but only the forward pass is implemented, which means you can't use it for training: Here is how to use it for better performance: All other Triton-based implementations (including AOTriton) aren't going to perform better than the CK one for Navi3x, but some of them implement the backward pass. They do save a lot of VRAM, but may not even perform better than the Math implementation (the fallback one in PyTorch) in some cases. There is even a rocWMMA-based implementation by an unofficial developer. I haven't tried it, but if you are interested, follow the thread: |
just to be clear, i have had every one of these packages that we have discussed already installed and multiple issues regarding most of them posted on github. i dont know how
I'll have to look into this more and I appreciate you taking your time.
anything to save ram would be great. I get OOMs very often, and i don't even max out the CPU. Something snit right on the training, but now that you mention it, i probably was nerfing my self with flash attention. I probably need to delete my venv and start fresh. I really wish they jut gave us whls. I waste so much time trying to get this card to work and it's still not really capable. |
looks good right now i was in the wrong env when i tried to build it... Import times for custom nodes:
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/websocket_image_save.py
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-Image-Selector
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-SD3-Powerlab
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-SD3LatentSelectRes
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/comfyui-ollama
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/comfyui-selector
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-HQ-Image-Save
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI_3dPoseEditor
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/SD3-Scaling
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/sd-dynamic-thresholding
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI_Noise
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-Thumbnails
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI_Cutoff
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI_ADV_CLIP_emb
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-AutoTrimBG
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/stability-ComfyUI-nodes
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-SAI_API
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-SD3-nodes
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI_JPS-Nodes
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/comfyui-ollama-prompt-encode
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI_IPAdapter_plus
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-Flowty-TripoSR
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-layerdiffuse
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-Custom-Scripts
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-Video-Matting
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-LuminaWrapper
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI_UltimateSDUpscale
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/steerable-motion
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI_essentials
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/comfyui-browser
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/RES4LYF
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-DynamiCrafterWrapper
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-GGUF
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/comfyui-portrait-master
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-Flowty-CRM
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/comfyui-sound-lab
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/comfyui-dream-video-batches
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-KJNodes
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/comfyui_bmad_nodes
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-segment-anything-2
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/comfy-image-saver
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-0246
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-AnimateAnyone-Evolved
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-CCSR
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/rgthree-comfy
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-TiledDiffusion
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-Keyframed
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-SUPIR
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI_Comfyroll_CustomNodes
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-PhotoMaker-ZHO
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-AnimateDiff-Evolved
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-Inspire-Pack
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/facerestore_cf
0.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-AudioReactor
0.1 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-Impact-Pack
0.1 seconds: /home/musclez/ComfyUI/custom_nodes/SeargeSDXL
0.1 seconds: /home/musclez/ComfyUI/custom_nodes/comfyui-reactor-node
0.1 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-VideoHelperSuite
0.2 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-Manager
0.2 seconds: /home/musclez/ComfyUI/custom_nodes/StableZero123-comfyui
0.2 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-PixelArt-Detector
0.2 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI_Jags_Audiotools
0.3 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI_InstantID
0.3 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI_StableAudio_Open
0.7 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI_Jags_VectorMagic
0.8 seconds: /home/musclez/ComfyUI/custom_nodes/anynode
0.8 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-DeepFuze
1.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI_Primere_Nodes
1.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-TTools
1.4 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUi-Ollama-YN
1.7 seconds: /home/musclez/ComfyUI/custom_nodes/was-node-suite-comfyui
2.1 seconds: /home/musclez/ComfyUI/custom_nodes/comfyui-mixlab-nodes
2.5 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-APISR-KJ
3.4 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-AudioReactive
3.4 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-StableAudioSampler
4.6 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI_OpenVoice
7.0 seconds: /home/musclez/ComfyUI/custom_nodes/ComfyUI-MotionDiff |
do you know if it is possible to upgrade to 6.2 ROCm on WSL2? we need https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/cooperative_groups.html for https://github.com/graphdeco-inria/diff-gaussian-rasterization |
I guess it's not going to work. |
Problem Description
I get these errors often from various applications, this one if from ComfyUI.
Is scaled_dot_product_attention part of flash attention? I am using howiejay/navi_support which enables 7900XT gfx1100 flash attention support on ROCm devices.
Operating System
WSL2 Ubuntu 22.04 Windows 11
CPU
7800x3D
GPU
AMD Radeon RX 7900 XT
ROCm Version
ROCm 6.1.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: