[BUG] RoITransformer CUDA error: an illegal memory access was encountered #340

xiaoyihit · 2022-06-07T17:27:21Z

Describe the bug
python tools/train.py 'configs/roi_trans/roi_trans_r50_fpn_1x_dota_le90.py'

Environment

sys.platform: linux
Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
CUDA available: True
GPU 0,1: GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.1, V11.1.105
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.10.0
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX512
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.1
OpenCV: 4.5.4
MMCV: 1.5.2
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.1
MMRotate: 0.3.0+

Error traceback
Traceback (most recent call last):
File "tools/train.py", line 294, in
main()
File "tools/train.py", line 288, in main
meta=meta)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/apis/train.py", line 156, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 109, in new_func
return old_func(*args, **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/detectors/two_stage.py", line 150, in forward_train
**kwargs)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 238, in forward_train
rcnn_train_cfg)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 155, in _bbox_forward_train
bbox_results = self._bbox_forward(stage, x, rois)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 126, in _bbox_forward
rois)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 197, in new_func
return old_func(*args, **kwargs)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/roi_heads/roi_extractors/rotate_single_level_roi_extractor.py", line 133, in forward
roi_feats_t = self.roi_layers[i](feats[i], rois)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 171, in forward
self.clockwise)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 70, in forward
clockwise=ctx.clockwise)
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1634272178570/work/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f20a94ffd62 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c613 (0x7f2100dee613 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1a2 (0x7f2100def022 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f20a94e9314 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: + 0x294dd9 (0x7f217ee66dd9 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xae2f59 (0x7f217f6b4f59 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object) + 0x2b9 (0x7f217f6b5279 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #24: __libc_start_main + 0xe7 (0x7f21ba3cdbf7 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

zytx121 · 2022-06-08T07:36:08Z

Could you please try:

CUDA_LAUNCH_BLOCKING=1 python tools/train.py configs/roi_trans/roi_trans_r50_fpn_1x_dota_le90.py

and update your error report here.

xiaoyihit · 2022-06-08T08:02:08Z

Could you please try:
CUDA_LAUNCH_BLOCKING=1 python tools/train.py configs/roi_trans/roi_trans_r50_fpn_1x_dota_le90.py
and update your error report here.

Although I DID add CUDA_LAUNCH_BLOCKING=1 on the first place, I tried your given command, and it outputs like this.
Didn`t find any difference.

Traceback (most recent call last):
File "tools/train.py", line 197, in
main()
File "tools/train.py", line 193, in main
meta=meta)
File "/remote-home/xiaoyi/github/mmrotate-0.3.0/mmrotate/apis/train.py", line 141, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 130, in run
epoch_runner(data_loaders[i], **kwargs)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
return old_func(*args, **kwargs)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/remote-home/xiaoyi/github/mmrotate-0.3.0/mmrotate/models/detectors/two_stage.py", line 150, in forward_train
**kwargs)
File "/remote-home/xiaoyi/github/mmrotate-0.3.0/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 241, in forward_train
rcnn_train_cfg)
File "/remote-home/xiaoyi/github/mmrotate-0.3.0/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 158, in _bbox_forward_train
bbox_results = self._bbox_forward(stage, x, rois)
File "/remote-home/xiaoyi/github/mmrotate-0.3.0/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 129, in _bbox_forward
rois)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 198, in new_func
return old_func(*args, **kwargs)
File "/remote-home/xiaoyi/github/mmrotate-0.3.0/mmrotate/models/roi_heads/roi_extractors/rotate_single_level_roi_extractor.py", line 137, in forward
roi_feats_t = self.roi_layers[i](feats[i], rois)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, kwargs)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 171, in forward
self.clockwise)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 70, in forward
clockwise=ctx.clockwise)
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1634272178570/work/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fef61fb8d62 in /opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c613 (0x7fefb98a7613 in /opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1a2 (0x7fefb98a8022 in /opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7fef61fa2314 in /opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: + 0x294dd9 (0x7ff03791fdd9 in /opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xae2f59 (0x7ff03816df59 in /opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object) + 0x2b9 (0x7ff03816e279 in /opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #24: __libc_start_main + 0xe7 (0x7ff072e48bf7 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

zytx121 · 2022-06-09T15:13:14Z

What command do you use to install the mmcv-full?

xiaoyihit · 2022-06-09T15:15:45Z

What command do you use to install the mmcv-full?

pip install mmcv-full

zytx121 · 2022-06-13T01:57:53Z

Please uninstall it first, and use mim to install mmcv-full:

pip install -U openmim
mim install mmcv-full

Or you need to specify version of mmcv-full by yourself:

pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.10.0/index.html

xiaoyihit · 2022-06-13T04:07:25Z

CUDA_LAUNCH_BLOCKING=1 python tools/train.py configs/roi_trans/roi_trans_r50_fpn_1x_dota_le90.py

Successfully installed mmcv-full-1.5.2. Problem remains unsolved.

Traceback (most recent call last):
File "tools/train.py", line 192, in
main()
File "tools/train.py", line 188, in main
meta=meta)
File "/remote-home/xiaoyi/github/mmrotate-0.3.1/mmrotate/apis/train.py", line 141, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 130, in run
epoch_runner(data_loaders[i], **kwargs)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
return old_func(*args, **kwargs)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/remote-home/xiaoyi/github/mmrotate-0.3.1/mmrotate/models/detectors/two_stage.py", line 150, in forward_train
**kwargs)
File "/remote-home/xiaoyi/github/mmrotate-0.3.1/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 241, in forward_train
rcnn_train_cfg)
File "/remote-home/xiaoyi/github/mmrotate-0.3.1/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 158, in _bbox_forward_train
bbox_results = self._bbox_forward(stage, x, rois)
File "/remote-home/xiaoyi/github/mmrotate-0.3.1/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 129, in _bbox_forward
rois)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 198, in new_func
return old_func(*args, **kwargs)
File "/remote-home/xiaoyi/github/mmrotate-0.3.1/mmrotate/models/roi_heads/roi_extractors/rotate_single_level_roi_extractor.py", line 137, in forward
roi_feats_t = self.roi_layers[i](feats[i], rois)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, kwargs)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 171, in forward
self.clockwise)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 70, in forward
clockwise=ctx.clockwise)
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1634272178570/work/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7ff7bb4c7d62 in /opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c613 (0x7ff812db6613 in /opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1a2 (0x7ff812db7022 in /opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7ff7bb4b1314 in /opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: + 0x294dd9 (0x7ff890e2edd9 in /opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xae2f59 (0x7ff89167cf59 in /opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object) + 0x2b9 (0x7ff89167d279 in /opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #24: __libc_start_main + 0xe7 (0x7ff8cc357bf7 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

zytx121 · 2022-06-20T01:32:44Z

What command do you use to install the mmcv-full? The mmcv-full 1.5.2 have many compiled version, it looks like you install the error version.

xiaoyihit · 2022-06-20T02:23:28Z

What command do you use to install the mmcv-full? The mmcv-full 1.5.2 have many compiled version, it looks like you install the error version.

pip install openmim
mim install mmcv-full

I will try to specify version of mmcv-full myself. After that I will update my results here.

xiaoyihit · 2022-06-20T03:34:33Z

pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.10.0/index.html

Installed mmcv-full 1.5.3 with command above.
Run python tools/train.py 'configs/roi_trans/roi_trans_r50_fpn_1x_dota_le90.py'
First time, we good.
Partial log:
2022-06-20 10:40:50,538 - mmrotate - INFO - Exp name: roi_trans_r50_fpn_1x_dota_le90.py
2022-06-20 10:40:50,539 - mmrotate - INFO - Epoch [1][1000/6400] lr: 5.000e-03, eta: 11:12:47, time: 0.517, data_time: 0.141, memory: 8646, loss_rpn_cls: 0.1138, loss_rpn_bbox: 0.0643, s0.loss_cls: 0.3465, s0.acc: 89.5312, s0.loss_bbox: 0.2405, s1.loss_cls: 0.2302, s1.acc: 93.2569, s1.loss_bbox: 0.3186, loss: 1.3138, grad_norm: 5.2039
2022-06-20 10:41:16,840 - mmrotate - INFO - Epoch [1][1050/6400] lr: 5.000e-03, eta: 11:11:57, time: 0.526, data_time: 0.142, memory: 8646, loss_rpn_cls: 0.0919, loss_rpn_bbox: 0.0588, s0.loss_cls: 0.3602, s0.acc: 89.0137, s0.loss_bbox: 0.3129, s1.loss_cls: 0.2029, s1.acc: 94.2634, s1.loss_bbox: 0.3078, loss: 1.3345, grad_norm: 5.4933
2022-06-20 10:41:43,955 - mmrotate - INFO - Epoch [1][1100/6400] lr: 5.000e-03, eta: 11:12:05, time: 0.542, data_time: 0.140, memory: 8646, loss_rpn_cls: 0.1178, loss_rpn_bbox: 0.0514, s0.loss_cls: 0.3247, s0.acc: 90.5137, s0.loss_bbox: 0.2465, s1.loss_cls: 0.2252, s1.acc: 93.7893, s1.loss_bbox: 0.3383, loss: 1.3040, grad_norm: 5.4267
2022-06-20 10:42:11,028 - mmrotate - INFO - Epoch [1][1150/6400] lr: 5.000e-03, eta: 11:12:07, time: 0.541, data_time: 0.141, memory: 8646, loss_rpn_cls: 0.0917, loss_rpn_bbox: 0.0559, s0.loss_cls: 0.3154, s0.acc: 90.3906, s0.loss_bbox: 0.2808, s1.loss_cls: 0.1796, s1.acc: 94.5897, s1.loss_bbox: 0.2755, loss: 1.1989, grad_norm: 5.0490
2022-06-20 10:42:37,217 - mmrotate - INFO - Epoch [1][1200/6400] lr: 5.000e-03, eta: 11:11:11, time: 0.524, data_time: 0.145, memory: 8646, loss_rpn_cls: 0.0983, loss_rpn_bbox: 0.0534, s0.loss_cls: 0.3391, s0.acc: 89.9336, s0.loss_bbox: 0.2663, s1.loss_cls: 0.2060, s1.acc: 94.0641, s1.loss_bbox: 0.2539, loss: 1.2170, grad_norm: 5.1556
2022-06-20 10:43:03,545 - mmrotate - INFO - Epoch [1][1250/6400] lr: 5.000e-03, eta: 11:10:26, time: 0.527, data_time: 0.140, memory: 8646, loss_rpn_cls: 0.0974, loss_rpn_bbox: 0.0611, s0.loss_cls: 0.3693, s0.acc: 89.2676, s0.loss_bbox: 0.2676, s1.loss_cls: 0.2668, s1.acc: 92.5827, s1.loss_bbox: 0.3967, loss: 1.4588, grad_norm: 5.7305
2022-06-20 10:43:30,119 - mmrotate - INFO - Epoch [1][1300/6400] lr: 5.000e-03, eta: 11:09:57, time: 0.531, data_time: 0.156, memory: 8646, loss_rpn_cls: 0.0830, loss_rpn_bbox: 0.0542, s0.loss_cls: 0.3160, s0.acc: 89.9883, s0.loss_bbox: 0.2747, s1.loss_cls: 0.2058, s1.acc: 93.9955, s1.loss_bbox: 0.3153, loss: 1.2489, grad_norm: 5.0566
2022-06-20 10:43:57,989 - mmrotate - INFO - Epoch [1][1350/6400] lr: 5.000e-03, eta: 11:10:40, time: 0.557, data_time: 0.149, memory: 8646, loss_rpn_cls: 0.0893, loss_rpn_bbox: 0.0552, s0.loss_cls: 0.3041, s0.acc: 90.4238, s0.loss_bbox: 0.2686, s1.loss_cls: 0.1922, s1.acc: 94.1526, s1.loss_bbox: 0.2801, loss: 1.1895, grad_norm: 4.8888

Second time, same bug occured.
Traceback (most recent call last):
File "tools/train.py", line 326, in
main()
File "tools/train.py", line 320, in main
meta=meta)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/apis/train.py", line 156, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 130, in run
epoch_runner(data_loaders[i], **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(*args, **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/detectors/two_stage.py", line 150, in forward_train
**kwargs)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 238, in forward_train
rcnn_train_cfg)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 155, in _bbox_forward_train
bbox_results = self._bbox_forward(stage, x, rois)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 126, in _bbox_forward
rois)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 205, in new_func
return old_func(*args, **kwargs)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/roi_heads/roi_extractors/rotate_single_level_roi_extractor.py", line 133, in forward
roi_feats_t = self.roi_layers[i](feats[i], rois)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 171, in forward
self.clockwise)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 70, in forward
clockwise=ctx.clockwise)
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1634272178570/work/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f7cd9c2bd62 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c613 (0x7f7d3151a613 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1a2 (0x7f7d3151b022 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f7cd9c15314 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: + 0x294dd9 (0x7f7daf592dd9 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xae2f59 (0x7f7dafde0f59 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object) + 0x2b9 (0x7f7dafde1279 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #24: __libc_start_main + 0xe7 (0x7f7deaaf9bf7 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Uninstalled mmcv-full with command
pip uninstall mmcv-full
and retried. Bug remains.

I literally do not understand...

xiaoyihit · 2022-10-11T07:20:44Z

pip install mmcv-full

…

------------------ 原始邮件 ------------------ 发件人: ***@***.***>; 发送时间: 2022年6月9日(星期四) 晚上11:13 收件人: "open-mmlab/mmrotate"; 抄送: "Author"; 主题: Re: [open-mmlab/mmrotate] [BUG] RoITransformer CUDA error: an illegal memory access was encountered (Issue #340) What command do you use to install the mmcv-full? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

xiaoyihit closed this as completed Jun 21, 2022

xiaoyihit reopened this Jun 22, 2022

xiaoyihit closed this as completed Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] RoITransformer CUDA error: an illegal memory access was encountered #340

[BUG] RoITransformer CUDA error: an illegal memory access was encountered #340

xiaoyihit commented Jun 7, 2022

zytx121 commented Jun 8, 2022

xiaoyihit commented Jun 8, 2022

zytx121 commented Jun 9, 2022

xiaoyihit commented Jun 9, 2022

zytx121 commented Jun 13, 2022

xiaoyihit commented Jun 13, 2022

zytx121 commented Jun 20, 2022

xiaoyihit commented Jun 20, 2022

xiaoyihit commented Jun 20, 2022 •

edited

Loading

xiaoyihit commented Oct 11, 2022 via email

[BUG] RoITransformer CUDA error: an illegal memory access was encountered #340

[BUG] RoITransformer CUDA error: an illegal memory access was encountered #340

Comments

xiaoyihit commented Jun 7, 2022

zytx121 commented Jun 8, 2022

xiaoyihit commented Jun 8, 2022

zytx121 commented Jun 9, 2022

xiaoyihit commented Jun 9, 2022

zytx121 commented Jun 13, 2022

xiaoyihit commented Jun 13, 2022

zytx121 commented Jun 20, 2022

xiaoyihit commented Jun 20, 2022

xiaoyihit commented Jun 20, 2022 • edited Loading

xiaoyihit commented Oct 11, 2022 via email

xiaoyihit commented Jun 20, 2022 •

edited

Loading