-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Relax] Integrate cuDNN attention #17157
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @vinx13 for the new addition!
Overall, looks good to me.
Would you describe high-level strategy for attention somewhere? (e.g., when to offload cudnn, cutlass, TIR, etc.)
If this is about landing machinery rather than such offloading decision, would appreciate if you can provide some recommendations.
The new attention can be applied via cudnn BYOC. The decision of which BYOC backend (cudnn, cutlass) to use is left to the users. cudnn is likely to perform better on H100 as it has specific optimizations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember cuDNN attention supports fp8, would be interesting to support that too.
This integrates cuDNN attention kernels to BYOC.
A dependency of cudnn_frontend is added.
The cuDNN attention kernel supports fused qkv in BS3NH and SBN3H layouts.
cc @sunggg @masahi @yongwww @tqchen