-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: OpenAI Triton-only backend #5083
Comments
I think this is a good idea Question --> what would you do about the other (non-attention) custom ops?
We also do not have these parameterized in the way we do for attention + will likely need to update this |
@bringlein Thanks for putting together this RFC, and shared your experiment results. Agreed that this is a good idea to have the backend for easier portability for various GPUS, and requiring much fewer lines of code (as you mentioned: LoC of 3500 vs 500) while not hurting performance is the way to go. |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
Motivation.
Recently, the OpenAI Triton backend for AMD hardware PR 3643 was merged, which is so far the only flash attention backend with the source code part of vLLM. Some of the advantages of OpenAI Triton are superior platform and performance portability. Therefore, we (@tdoublep and myself) wanted to investigate if this code could work equally well on a different platform, i.e. NVIDIA GPUs.
Our experiments show that using the code contributed by AMD on different NVIDIA hardware (A100, L40, H100) results in competitive prefill performance compared to the default option (
flash_attn
). For smaller number of heads, which may be the case when using tensor parallelism, it is even faster.For this experiments, we used the code contributed by AMD, but replaced the autotuner options with options more suited for the different GPUs. However, we did not change the actual Triton code.
Therefore, could we consider a Triton-only backend? While this does not (yet) result in a performance advantage in all cases, there are a number of additional technical motivations:
Proposed Change.
We propose to add a new backend that runs the flash attention Triton code on both NVIDIA and AMD platforms. We would propose to start with the existing flash attention, but we also want to discuss the option for other kernels.
We would also contribute our additional options for the Triton autotuner, so that the results of the blue curves above could be achieved.
Feedback Period.
No response
CC List.
@hongxiayang @WoosukKwon @Yard1 @jpvillam-amd
Any Other Things.
No response
The text was updated successfully, but these errors were encountered: