Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMP will crash with non-tensorcore GPUs #528

Closed
FabianIsensee opened this issue Oct 7, 2019 · 6 comments
Closed

AMP will crash with non-tensorcore GPUs #528

FabianIsensee opened this issue Oct 7, 2019 · 6 comments

Comments

@FabianIsensee
Copy link

Hi there,
I updated apex today (pulled from github) and now I am getting error when running mixed precision training on GPUs that don't have tensorcores. The following snipped comes from running a 2D U-Net on a TitanXp GPU:

RuntimeError: CUDA error: no kernel image is available for execution on the device (multi_tensor_apply at csrc/multi_tensor_apply.cuh:104)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x2b148356f543 in /home/isensee/dl_venv_new/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: void multi_tensor_apply<2, ScaleFunctor<float, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, ScaleFunctor<float, float>, float) + 0xba6 (0x2b149e4db0f6 in /home/isensee/dl_venv_new/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/amp_C.cpython-37m-x86_64-linux-gnu.so)
frame #2: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, float) + 0xa90 (0x2b149e4d8c50 in /home/isensee/dl_venv_new/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/amp_C.cpython-37m-x86_64-linux-gnu.so)
frame #3: + 0x229b7 (0x2b149e4ca9b7 in /home/isensee/dl_venv_new/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/amp_C.cpython-37m-x86_64-linux-gnu.so)
frame #4: + 0x1d5af (0x2b149e4c55af in /home/isensee/dl_venv_new/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/amp_C.cpython-37m-x86_64-linux-gnu.so)

frame #49: __libc_start_main + 0xf5 (0x2b1416258c05 in /lib64/libc.so.6)
frame #50: python3() [0x400721]

Any idea what's going on?
Best,
Fabian

@FabianIsensee
Copy link
Author

I built apex with

python setup.py install --cuda_ext --cpp_ext

@FabianIsensee
Copy link
Author

FabianIsensee commented Oct 7, 2019

I have now done several experiments. First of all, I changed my installation command to the one from the readme:

pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Second, the error above appears on different GPUs depending on where I built apex. I am running apex on our GPU cluster, which has different types of GPUs: RTX2080ti, TitanXp and V100. All software is shared between the nodes in the cluster, so if I compile and install apex on one node, all nodes will have that version.

Here is what I found:

  1. If I build and install apex on a RTX2080ti node, it will work on RTX2080ti cards but not in TitanXp or V100
  2. If I build and install apex on a V100 node, it will work on RTX2080ti and V100, but not on TitanXp
  3. If I build and install apex on a TitanXp node, it will work on TitanXp, but not on RTX2080ti or V100

Has it always been like this? I cannot remember having any problems in the past.

gcc version is 7.2.0, cuda version is 10.0, pytorch is the most recent nightly.
I would very much appreciate your help!

Best,
Fabian

Edit: The whole problem does not appear if I do a python-only installation. This will then of course give some warning:

Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")

And I get a small performance penalty

@mcarilli
Copy link
Contributor

mcarilli commented Oct 7, 2019

This is an artifact of some recent changes to how Pytorch builds extensions:
pytorch/pytorch#23408
If the environment variable TORCH_CUDA_ARCH_LIST is not set, Pytorch will build extensions for the architecture on the node where you are compiling (e.g. if you are compiling on a node with V100, it will compile for Volta, which will work for Volta and probably for Turing as well). Apex is set up to respect this logic, unless you are building on a system with no GPUs, in which case Apex sets TORCH_CUDA_ARCH_LIST to build for all compute capabilities from Pascal through Turing.

In your case, if you want a single build that works for Titan Xp (compute capability 6.1), V100 (cc 7.0), and RTX2080Ti (cc 7.5), you can

$ pip uninstall apex # repeat if multiple installations occurred by accident
$ export TORCH_CUDA_ARCH_LIST="6.1;7.0;7.5"
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

@mcarilli mcarilli closed this as completed Oct 7, 2019
@FabianIsensee
Copy link
Author

Outstanding, thank you!

@MrRobot2211
Copy link

Thank you.

@ethanjperez
Copy link

@mcarilli It might be worth advertising this fact/fix in the README

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants