Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undocumented CUDA graphs requirement that kernels must use stream #114048

Open
tsengalb99 opened this issue Nov 19, 2023 · 5 comments
Open

Undocumented CUDA graphs requirement that kernels must use stream #114048

tsengalb99 opened this issue Nov 19, 2023 · 5 comments
Labels
module: cuda graphs Ability to capture and then replay streams of CUDA kernels module: custom-operators custom operators, custom ops, custom-operators, custom-ops triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@tsengalb99
Copy link

tsengalb99 commented Nov 19, 2023

🐛 Describe the bug

I have a function that calls some custom cuda kernels interleaved with pytorch operations. When I try to capture the function with a cuda graph, the cuda kernels become no-ops. That is, the graph captures fine and performs all operations except for those in the cuda kernels. Does torch's cuda graph work with custom kernels? Is there something special I need to do to enable custom kernels with graphs?

Versions

torch 2.1.0 on cuda 11.8

cc @mcarilli @ezyang

@jbschlosser jbschlosser added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: custom-operators custom operators, custom ops, custom-operators, custom-ops module: cuda graphs Ability to capture and then replay streams of CUDA kernels labels Nov 20, 2023
@ezyang
Copy link
Contributor

ezyang commented Nov 20, 2023

It does work with custom kernels, but there are a number of ways you could have messed up the kernels so that bad things happen. You'll probably have to share some code for more info.

@tsengalb99
Copy link
Author

I managed to get my kernel to work by passing the current cuda stream to the kernel call in the c++ wrapper. I only figured this out by finding a kernel that did work and comparing my code to that; is this requirement documented anywhere in pytorch? Not passing in the stream works fine when calling the c++ wrapper outside of a graph.

@ezyang
Copy link
Contributor

ezyang commented Nov 20, 2023

It's a requirement for CUDA graphs itself. We could remind users about it in the CUDA graph API, wanna send a doc patch?

@ezyang ezyang changed the title torch cuda graph with custom cuda kernel possible bug Undocumented CUDA graphs requirement that kernels must use stream Nov 20, 2023
@ngimel
Copy link
Collaborator

ngimel commented Dec 4, 2023

CUDA graph API will warn if there were no kernels captured (e.g. if capture started on stream S but all the kernels were on the default stream), however, a mixture of kernels on default and capturing stream (without stream synchronizations in between) is an error that we can't possibly catch - maybe user did want those kernels to run and not be captured.

@Abhishekghosh1998
Copy link

@tsengalb99, can you please share your code if possible? I'm just curious to know about your approach and where/how it went wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cuda graphs Ability to capture and then replay streams of CUDA kernels module: custom-operators custom operators, custom ops, custom-operators, custom-ops triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants