-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMP will crash with non-tensorcore GPUs #528
Comments
I built apex with
|
I have now done several experiments. First of all, I changed my installation command to the one from the readme:
Second, the error above appears on different GPUs depending on where I built apex. I am running apex on our GPU cluster, which has different types of GPUs: RTX2080ti, TitanXp and V100. All software is shared between the nodes in the cluster, so if I compile and install apex on one node, all nodes will have that version. Here is what I found:
Has it always been like this? I cannot remember having any problems in the past. gcc version is 7.2.0, cuda version is 10.0, pytorch is the most recent nightly. Best, Edit: The whole problem does not appear if I do a python-only installation. This will then of course give some warning:
And I get a small performance penalty |
This is an artifact of some recent changes to how Pytorch builds extensions: In your case, if you want a single build that works for Titan Xp (compute capability 6.1), V100 (cc 7.0), and RTX2080Ti (cc 7.5), you can
|
Outstanding, thank you! |
Thank you. |
@mcarilli It might be worth advertising this fact/fix in the README |
Hi there,
I updated apex today (pulled from github) and now I am getting error when running mixed precision training on GPUs that don't have tensorcores. The following snipped comes from running a 2D U-Net on a TitanXp GPU:
Any idea what's going on?
Best,
Fabian
The text was updated successfully, but these errors were encountered: