-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[build] build against installed cuda-11.1 while torch built w/ cuda-11.0 #570
Conversation
… built with cuda-11.0 I learned this from nvidia apex, it works to build against installed cuda-11.1 while torch was built with cuda-11.0 - as the API is similar (identical?) Can probably remove this when cuda-11.2 comes out and we get pytorch supporting Ampere - until then pytorch can't be built with cuda-11.1. I verified that I was able to build all options with: ``` DS_BUILD_OPS=1 pip install deepspeed -v . ``` ``` ds_report -------------------------------------------------- DeepSpeed C++/CUDA extension op report -------------------------------------------------- NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. -------------------------------------------------- JIT compiled ops requires ninja ninja .................. [OKAY] -------------------------------------------------- op name ................ installed .. compatible -------------------------------------------------- cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] sparse_attn ............ [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY] -------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch'] torch version .................... 1.8.0.dev20201202+cu110 torch cuda version ............... 11.0 nvcc version ..................... 11.1 deepspeed install path ........... ['/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.3.7+7a75f8b, 7a75f8b, master deepspeed wheel compiled w. ...... torch 1.8, cuda 11.0 ``` Otherwise with rtx-3090 I was getting: `RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8`
Thanks for this @stas00. Would you also be able to run our unit tests in your environment? I don't readily have access to a cuda 11.1 machine (most up to date I have is 11.0) and don't have any access to rtx-3090s either.
|
I tried but getting lots of errors,
I aborted as it was too slow and already clear that something isn't right - but I have never run this test suite before so perhaps the failures are unrelated. Your CI seems to be running all those fine. Here is error log so far:
|
Ah yeah, I think this makes sense. PR #572 should fix this issue. Essentially when pre-compiling our ops we weren't passing the compute capability flags for 8.0 which builds the cuda/c++ code with the right hardware capabilities. I think the unit tests should work if you instead re-install and use JIT only compilation. JIT should pickup whatever compute capability that is being used at runtime. |
Merged #572, looks like it has a merge conflict though. Can you give it a try on your end after fixing the merge conflict with your change? Should be small I think. |
I built the binaries on your branch, tried one test - no change:
will try jit next. |
jit fails too, also weirdly it skipped over rtx-3090 card (0), and run the test on another older card (1). (same test as above) There is a very long output, ending with:
|
|
it wasn't set - both cards should have been visible
As you commented in 2 previous comments above yours - it didn't help. Please let me know if you need any other setup info to diagnose these. |
I just created a new issue for us to fix our unit tests so they are runnable on < 4 gpus. Unfortunately right now the best way for you to test out if everything is setup right is a combination of I think the original PR here seems fine so I'll merge it now. If you have any issues here don't hesitate to open an issue though. |
I have 2 gpus in my current box and building another box with 2 more older gpus. So hoping to be able to test complex multi-node setups. I'm just impatiently waiting for cuda-11.2 to be released so that pytorch and friends would fully support rtx-3090. It has been a month of pain since I got the card... |
@stas00 @jeffra I came across this issue because I'm having a somewhat related issue. I was recently trying to scale up my training jobs to 30+ nodes, and I found that the
It didn't really happen when I was training with lesser than 30 nodes. Any ideas on what's going on? Also, this stack trace was printed just 3 times in my log file when I was training with 30 nodes (8 GPUs each), i.e., 240 processes. I suspect this implies that the reset only happened on a few processes, not all of them. However, the actual training obviously didn't really start due to this RuntimeError on some processes. If this is unrelated to this issue (I just landed on this because I searched for |
The PR itself was about a totally different thing. It's the tests that failed to run on my 2-gpu single node setup, but the failure was unrelated to the PR itself. The test failure does look similar to your issue. If you look at my report - process 1 reported a failure, and in response process 2 reset the connection. Is it possible that some process failed in your setup but somehow you didn't see the error reported? I saw at least one situation with deepspeed where it just But probably a separate issue would be a good way to proceed and link to the failure I reported as related. |
I learned this from nvidia apex, it works to build against installed cuda-11.1 while torch was built with cuda-11.0 - as the API is similar (identical?). Note that tensorflow requires cuda-11.1 to work with rtx-30*. So while I do have 11.0 and 11.1 installed, the builder can't find 11.0 automatically.
Can probably remove this when cuda-11.2 comes out and we get pytorch fully supporting Ampere - until then pytorch can't be built with cuda-11.1.
I verified that I was able to build all options with:
Otherwise with rtx-3090 I was getting:
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8