-
Notifications
You must be signed in to change notification settings - Fork 5.5k
multi-GPU training throw an illegal memory access #32
Comments
I got the same error. The difference is when i use one GPU or two GPUs , there is no problem. But using 4 GPUs to train Mask RCNN (mask_rcnn_R-101-FPN) or RetinaNet (retinanet_R-101-FPN), the same problem occurs. |
I have the same problem when I train the tutorial_Res50 network with two or more GPUs. |
Encountered same issue when specifying GPU ids (i.e. different from lowest ids, e.g. '1,3,5,7' for 4 GPUs). If lowest GPU ids are specified, training goes on fine. |
@jwnsu: we're working on a fix so that when |
Hi @jwnsu, @coolbrain, @tshizys, @lwher: we are unable to reproduce this issue on our side. Can you each provide some more information that might reveal a common pattern? In particular:
Here's what we see when training, for example, with GPU ids 1,3,5,7:
|
Operating system: Ubuntu 16.04 nvidia-smi: |
Operating system: CentOS Linux release 7.1.1503 When using 4 GPUs (0,1,2,3) to train Mask RCNN (e2e_mask_rcnn_R-101-FPN) , RetinaNet (retinanet_R-101-FPN) or Faster RCNN (e2e_faster_rcnn_R-50-FPN), the error “context_gpu.h:307: an illegal memory access was encountered” or “context_gpu.h:170. Encountered CUDA error: an illegal memory access was encountered Error from operator: input: "gpu_0/retnet_cls_pred_fpn3_b_grad" input: "gpu_2/retnet_cls_pred_fpn3_b_grad" output: "gpu_0/retnet_cls_pred_fpn3_b_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 } ” occurs. But using one GPU or two GPUS (0,1 or 2,3), it can be trained normally. |
@jwnsu: looking at your error more closely ("invalid device ordinal"), it looks like you're trying to train with a config set up for 8 GPUs but restricting the process to have only access to 4 (via |
@coolbrain, @tshizys: thanks for the details. What happens if you use two GPUs using ids {0,2}, {0,3}, {1,2}, or {1,3}? |
@rbgirshick you are right, picked wrong config file (with 8 GPUs setting) to try yesterday. Just tried again with the right config file (4 GPUs, error from gpu ids "1,2,4,5", "0,1,2,3" works fine), the error is now similar to what others are seeing:
|
@coolbrain, @tshizys: one shot in the dark is to switch the all-reduce implementation to nccl by passing
This will require Caffe2 to have been built with nccl ops -- I'm not sure if this is done by default or will require some work to rebuild Caffe2 with nccl support. |
@rbgirshick , when using two GPUs, i.e. {0,2}, {0,3}, {1,2}, {1,3}, the error still exists. Here is the details, using {0,3} and training RetinaNet (retinanet_R-101-FPN) for example: F0128 12:09:08.461153 4938 context_gpu.cu:387] Error at: /home/yszhu/local/caffe2/caffe2/core/context_gpu.cu:387: an illegal memory access was encountered The forms of error are not the same each time, but it's just "Encountered CUDA error: an illegal memory access was encountered". |
I also rebuild caffe2 with nccl-1.3.5 (following https://caffe2.ai/docs/getting-started.html?platform=centos&configuration=cloud#null__troubleshooting): and switch the all-reduce implementation to nccl by passing USE_NCCL True to train_net.py, as in: python2 tools/train_net.py --multi-gpu-testing The error disappeared ^--^ for both using four GPUs {0,1,2,3} or any of two GPUs {0,2}, {0,3}, {1,2}, {1,3}. |
Hi, I open the nccl op to train the tutorial_network and the error above disappeared. However, the program hangs after loading data and occupy 100% CPU all the time. ....... the program hangs......my environment: nvidia-smi: |
@lwher: that's unfortunate. The reason we don't use NCCL by default is that it's prone to causing deadlocks, which is what I think you're seeing. |
After rebuilding caffe2 with NCCL, I rerun the program with this script: It throws this error: Creating NCCLContext for key: 0:0,1,2,3, You should always run with libnvidia-ml.so that is installed with your !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! You should always run with libnvidia-ml.so that is installed with your Running Environment: nvidia-smi: |
One additional note about NCCL: Caffe2 builds with NCCL by default so there is no need to rebuild it. |
Jumping onto this: since the illegal memory access is from the Add operator, you might want to check if direct peer access is available between the gpus that you are using. Current Add op relies on that, and if not we might want to fix the code indeed. Basically, to do so, in python, do:
Could you paste the output of that for debugging? (Especially, if you are using CUDA_VISIBLE_DEVICES, make sure you invoke python with that too) |
@Yangqing output from your two debug lines:
thx for looking into this issue (and ... caffe/caffe2 frameworks!) |
@jwnsu thanks! Just to confirm, so the Add operator is adding tensors across gpu {0,1} and {2,3} right? (I assume it is adding stuff together from the 4 gpus). |
It's 4 gpus config, with GPU ids specified as "0,1,2,4" (via CUDA_VISIBLE_DEVICES.) If GPU ids are configured as "0,1,2,3" (lowest GPU ids), it works fine without any error. |
@Yangqing I can train net using 1 gpu well, but when I train net using 2 or 4 GPUS, I meet problems the same above, even I set NCCL = True |
Thanks guys. This verifies my assumption that the illegal memory access comes from the Add op not properly handling cross-device communications when peer access is not enabled. Will issue a fix. |
anybody tells me whether I can run mask r-cnn with only one GPU? |
@daquexian I tried your PR, it works!!! Thanks very much |
@daquexian This PR doesn't appear to work for me. I'm experiencing deadlocks while using a single GPU without NCCL and also while using 2 GPUs with |
Thanks for your trying :) You don't need to set USE_NCCL=True if you use my
pr. NCCL and "muji" are two different gpu communication methods. My pr is a
patch for muji, which required gpu peer access before, and not for nccl.
Just set USE_NCCL=False and my pr will work.
…On Wed, May 2, 2018, 2:51 AM Thomas Balestri ***@***.***> wrote:
@daquexian <https://github.com/daquexian> This PR doesn't appear to work
for me. I'm experiencing deadlocks while using a single GPU without NCCL
and also while using 2 GPUs with USE_NCCL True. After changing muji.py
according to your PR and running with 2 GPUs with USE_NCCL True, I'm
still experiencing a deadlock; the training just pauses at random iteration
numbers.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#32 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALEcn2nGO9e-fIF8S3bTDNkK4370hjOVks5tuK7DgaJpZM4Rsc8n>
.
|
Maybe I'm missing something, but if I set USE_NCCL=False, and use your modified muji.py and muji_test.py PR, I get the original error: I0502 14:35:57.192476 79712 context_gpu.cu:318] Total: 23025 MB
E0502 14:35:58.382604 79711 net_dag.cc:195] Exception from operator chain starting at '' (type 'Add'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: an illegal memory access was encountered Error from operator:
input: "gpu_0/rpn_cls_logits_fpn2_b_grad" input: "gpu_1/rpn_cls_logits_fpn2_b_grad" output: "gpu_0/rpn_cls_logits_fpn2_b_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
E0502 14:35:58.382622 79712 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'Add'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: an illegal memory access was encountered Error from operator:
input: "gpu_0/rpn_cls_logits_fpn2_w_grad" input: "gpu_1/rpn_cls_logits_fpn2_w_grad" output: "gpu_0/rpn_cls_logits_fpn2_w_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
F0502 14:35:58.382670 79711 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
F0502 14:35:58.382683 79712 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
E0502 14:35:58.383510 79709 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator:
input: "gpu_1/fpn_res3_3_sum" input: "gpu_1/conv_rpn_fpn2_w" input: "gpu_1/__m18_shared" output: "_gpu_1/conv_rpn_fpn2_w_grad_autosplit_2" output: "_gpu_1/conv_rpn_fpn2_b_grad_autosplit_2" output: "_gpu_1/fpn_res3_3_sum_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 1 } engine: "CUDNN" is_gradient_op: true
E0502 14:35:58.383541 79713 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at conv_op_cudnn.cc:1290] status == CUDNN_STATUS_SUCCESS. 8 vs 0. , Error at: /home/markable-ai/pytorch/caffe2/operators/conv_op_cudnn.cc:1290: CUDNN_STATUS_EXECUTION_FAILED Error from operator:
input: "gpu_3/conv_rpn_fpn4" input: "gpu_3/rpn_bbox_pred_fpn2_w" input: "gpu_3/rpn_bbox_pred_fpn4_grad" output: "_gpu_3/rpn_bbox_pred_fpn2_w_grad_autosplit_1" output: "_gpu_3/rpn_bbox_pred_fpn2_b_grad_autosplit_1" output: "gpu_3/__m13_shared" name: "" type: "ConvGradient" arg { name: "kernel" i: 1 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 0 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 3 } engine: "CUDNN" is_gradient_op: true
E0502 14:35:58.383591 79706 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator:
input: "gpu_3/conv_rpn_fpn3" input: "gpu_3/rpn_cls_logits_fpn2_w" input: "gpu_3/rpn_cls_logits_fpn3_grad" output: "_gpu_3/rpn_cls_logits_fpn2_w_grad_autosplit_2" output: "_gpu_3/rpn_cls_logits_fpn2_b_grad_autosplit_2" output: "_gpu_3/conv_rpn_fpn3_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 1 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 0 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 3 } engine: "CUDNN" is_gradient_op: true
F0502 14:35:58.382683 79712 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encounteredF0502 14:35:58.434631 79709 context_gpu.h:107] FCheck failed: error == cudaSuccess an illegal memory access was encountered0502 14:35:58.434648 79713 c*** Check failure stack trace: ***
E0502 14:35:58.383741 79700 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator:
input: "gpu_3/conv_rpn_fpn2" input: "gpu_3/rpn_cls_logits_fpn2_w" input: "gpu_3/rpn_cls_logits_fpn2_grad" output: "_gpu_3/rpn_cls_logits_fpn2_w_grad_autosplit_3" output: "_gpu_3/rpn_cls_logits_fpn2_b_grad_autosplit_3" output: "_gpu_3/conv_rpn_fpn2_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 1 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 0 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 3 } engine: "CUDNN" is_gradient_op: true
Aborted (core dumped)
I'm using Cuda 9.1, cudnn 7.1 with 4 V100s. |
@Feynman27 Could you tell me which branch(like |
>>> from caffe2.python import workspace
>>> print(workspace.GetCudaPeerAccessPattern())
[[ True False False False]
[False True False False]
[False False True False]
[False False False True]] I'll try calling |
Calling I0502 17:08:51.294476 88651 context_gpu.cu:318] Total: 22524 MB
E0502 17:08:52.009866 88659 net_dag.cc:195] Exception from operator chain starting at '' (type 'Add'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: an illegal memory access was encountered Error from operator:
input: "gpu_0/rpn_cls_logits_fpn2_w_grad" input: "gpu_1/rpn_cls_logits_fpn2_w_grad" output: "gpu_0/rpn_cls_logits_fpn2_w_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
F0502 17:08:52.009990 88659 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
E0502 17:08:52.010440 88651 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator:
input: "gpu_2/fpn_res3_3_sum" input: "gpu_2/conv_rpn_fpn2_w" input: "gpu_2/__m15_shared" output: "_gpu_2/conv_rpn_fpn2_w_grad_autosplit_2" output: "_gpu_2/conv_rpn_fpn2_b_grad_autosplit_2" output: "_gpu_2/fpn_res3_3_sum_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 2 } engine: "CUDNN" is_gradient_op: true
E0502 17:08:52.010524 88663 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator:
input: "gpu_1/fpn_res2_2_sum" input: "gpu_1/conv_rpn_fpn2_w" input: "gpu_1/__m12_shared" output: "_gpu_1/conv_rpn_fpn2_w_grad_autosplit_3" output: "_gpu_1/conv_rpn_fpn2_b_grad_autosplit_3" output: "_gpu_1/fpn_res2_2_sum_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 1 } engine: "CUDNN" is_gradient_op: true
F0502 17:08:52.010545 88660 context_gpu.cu:387] Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:387: an illegal memory access was encountered
*** Check failure stack trace: ***
F0502 17:08:52.010545 88660 context_gpu.cu:387] Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:387: an illegal memory access was encounteredF0502 17:08:52.061641 88651 context_gpu.hF107] 502 17:Ch:ck failed: error == cudaSuccess 52.061651 88663 context_gpu.h:
E0502 17:08:52.010577 88653 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator:
input: "gpu_0/fpn_res4_22_sum" input: "gpu_0/conv_rpn_fpn2_w" input: "gpu_0/__m15_shared" output: "_gpu_0/conv_rpn_fpn2_w_grad_autosplit_1" output: "_gpu_0/conv_rpn_fpn2_b_grad_autosplit_1" output: "_gpu_0/fpn_res4_22_sum_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN" is_gradient_op: true
*** Check failure stack trace: ***
F0502 17:08:52.010545 88660 context_gpu.cu:387] Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:387: an illegal memory access was encounteredF0502 17:08:52.061641 88651 context_gpu.hF107] 502 17:Ch:ck failed: error == cudaSuccess 52.061651 88663 context_gpu.h:
07] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
F0502 17:08:52.010545 88660 context_gpu.cu:387] Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:387: an illegal memory access was encounteredF0502 17:08:52.061641 88651 context_gpu.hF107] 502 17:Ch:ck failed: error == cudaSuccess 52.061651 88663 context_gpu.h:
07] Check failed: error == cudaSuccess an illegal memory access was encounteredF0502 17:08:52.061749 88653 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
Aborted (core dumped |
@Feynman27 It's strange. According to your gpu access pattern, |
@Feynman27 did you rebuild the caffe2 ? |
@daquexian The caffe2 package is installed under Python 2.7.14 |Anaconda, Inc.| (default, Mar 27 2018, 17:29:31)
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import caffe2
>>> caffe2.__file__
'/home/markable-ai/pytorch/build/caffe2/__init__.pyc'
>>> from caffe2.python import muji
>>> muji.__file__
'/home/markable-ai/pytorch/build/caffe2/python/muji.pyc'
>>> I simply modified the @yuzcccc I didn't rebuild caffe2, but why would I have to? I'm only modifying a python file. |
@Feynman27 I think you should modify |
Yep, that was my oversight. Good catch. I was modifying |
@Feynman27 It's happy to see it working :) |
@daquexian Unfortunately, I still seem to be experiencing deadlocks. |
@Feynman27 Hmm.. What is the value of |
Yes, |
@Feynman27 Sorry I have no idea why it will cause deadlock. It's hard to reproduce for me |
Fair enough. For all I know, the deadlock I'm experiencing could be unrelated to whether or not GPU peer access is enabled. Your PR definitely allowed me to start training with |
@daquexian Thanks! Your PR worked for me! |
Looks like this issue can be closed. |
@gadcam thanks for helping to identify issues that can be closed! For this one, I'd like to leave it open until there's a fix merged into Caffe2. |
@rbgirshick Unfortunately no one reviews my PR :| |
@rbgirshick Thanks! My PR pytorch/pytorch#6896 has been merged. It looks like this issue can be closed :) |
When I use one GPU to train, there is no problem. But when I use two or four GPUs, the problem come out. The log output:
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
what(): [enforce fail at context_gpu.h:170] . Encountered CUDA error: an illegal memory access was encountered Error from operator:
input: "gpu_0/rpn_cls_logits_fpn2_w_grad" input: "gpu_1/rpn_cls_logits_fpn2_w_grad" output: "gpu_0/rpn_cls_logits_fpn2_w_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
*** Aborted at 1516866180 (unix time) try "date -d @1516866180" if you are using GNU date ***
terminate called recursively
terminate called recursively
terminate called recursively
PC: @ 0x7ff67559f428 gsignal
terminate called recursively
terminate called recursively
E0125 07:43:00.745853 55683 pybind_state.h:422] Exception encountered running PythonOp function: RuntimeError: [enforce fail at context_gpu.h:307] error == cudaSuccess. 77 vs 0. Error at: /mnt/hzhida/project/caffe2/caffe2/core/context_gpu.h:307: an illegal memory access was encountered
At:
/mnt/hzhida/facebook/detectron/lib/ops/generate_proposals.py(101): forward
*** SIGABRT (@0x3e80000d84f) received by PID 55375 (TID 0x7ff453fff700) from PID 55375; stack trace: ***
terminate called recursively
@ 0x7ff675945390 (unknown)
@ 0x7ff67559f428 gsignal
@ 0x7ff6755a102a abort
@ 0x7ff66f37e84d __gnu_cxx::__verbose_terminate_handler()
@ 0x7ff66f37c6b6 (unknown)
@ 0x7ff66f37c701 std::terminate()
@ 0x7ff66f3a7d38 (unknown)
@ 0x7ff67593b6ba start_thread
@ 0x7ff67567141d clone
@ 0x0 (unknown)
Aborted (core dumped)
The text was updated successfully, but these errors were encountered: