Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RoITransformer CUDA error: an illegal memory access was encountered #340

Closed
xiaoyihit opened this issue Jun 7, 2022 · 10 comments
Closed

Comments

@xiaoyihit
Copy link

Describe the bug
python tools/train.py 'configs/roi_trans/roi_trans_r50_fpn_1x_dota_le90.py'

Environment

sys.platform: linux
Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
CUDA available: True
GPU 0,1: GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.1, V11.1.105
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.10.0
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX512
  • CUDA Runtime 11.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.0.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.1
OpenCV: 4.5.4
MMCV: 1.5.2
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.1
MMRotate: 0.3.0+

Error traceback
Traceback (most recent call last):
File "tools/train.py", line 294, in
main()
File "tools/train.py", line 288, in main
meta=meta)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/apis/train.py", line 156, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 109, in new_func
return old_func(*args, **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/detectors/two_stage.py", line 150, in forward_train
**kwargs)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 238, in forward_train
rcnn_train_cfg)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 155, in _bbox_forward_train
bbox_results = self._bbox_forward(stage, x, rois)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 126, in _bbox_forward
rois)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 197, in new_func
return old_func(*args, **kwargs)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/roi_heads/roi_extractors/rotate_single_level_roi_extractor.py", line 133, in forward
roi_feats_t = self.roi_layers[i](feats[i], rois
)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 171, in forward
self.clockwise)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 70, in forward
clockwise=ctx.clockwise)
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1634272178570/work/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f20a94ffd62 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c613 (0x7f2100dee613 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x1a2 (0x7f2100def022 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f20a94e9314 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: + 0x294dd9 (0x7f217ee66dd9 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xae2f59 (0x7f217f6b4f59 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object
) + 0x2b9 (0x7f217f6b5279 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #24: __libc_start_main + 0xe7 (0x7f21ba3cdbf7 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

@zytx121
Copy link
Collaborator

zytx121 commented Jun 8, 2022

Could you please try:

CUDA_LAUNCH_BLOCKING=1 python tools/train.py configs/roi_trans/roi_trans_r50_fpn_1x_dota_le90.py

and update your error report here.

@xiaoyihit
Copy link
Author

Could you please try:

CUDA_LAUNCH_BLOCKING=1 python tools/train.py configs/roi_trans/roi_trans_r50_fpn_1x_dota_le90.py

and update your error report here.

Although I DID add CUDA_LAUNCH_BLOCKING=1 on the first place, I tried your given command, and it outputs like this.
Didn`t find any difference.

Traceback (most recent call last):
File "tools/train.py", line 197, in
main()
File "tools/train.py", line 193, in main
meta=meta)
File "/remote-home/xiaoyi/github/mmrotate-0.3.0/mmrotate/apis/train.py", line 141, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 130, in run
epoch_runner(data_loaders[i], **kwargs)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
return old_func(*args, **kwargs)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/remote-home/xiaoyi/github/mmrotate-0.3.0/mmrotate/models/detectors/two_stage.py", line 150, in forward_train
**kwargs)
File "/remote-home/xiaoyi/github/mmrotate-0.3.0/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 241, in forward_train
rcnn_train_cfg)
File "/remote-home/xiaoyi/github/mmrotate-0.3.0/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 158, in _bbox_forward_train
bbox_results = self._bbox_forward(stage, x, rois)
File "/remote-home/xiaoyi/github/mmrotate-0.3.0/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 129, in _bbox_forward
rois)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 198, in new_func
return old_func(*args, **kwargs)
File "/remote-home/xiaoyi/github/mmrotate-0.3.0/mmrotate/models/roi_heads/roi_extractors/rotate_single_level_roi_extractor.py", line 137, in forward
roi_feats_t = self.roi_layers[i](feats[i], rois
)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, kwargs)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 171, in forward
self.clockwise)
File "/opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 70, in forward
clockwise=ctx.clockwise)
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1634272178570/work/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fef61fb8d62 in /opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c613 (0x7fefb98a7613 in /opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x1a2 (0x7fefb98a8022 in /opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7fef61fa2314 in /opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: + 0x294dd9 (0x7ff03791fdd9 in /opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xae2f59 (0x7ff03816df59 in /opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object
) + 0x2b9 (0x7ff03816e279 in /opt/conda/envs/mmrotatev3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #24: __libc_start_main + 0xe7 (0x7ff072e48bf7 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

@zytx121
Copy link
Collaborator

zytx121 commented Jun 9, 2022

What command do you use to install the mmcv-full?

@xiaoyihit
Copy link
Author

What command do you use to install the mmcv-full?

pip install mmcv-full

@zytx121
Copy link
Collaborator

zytx121 commented Jun 13, 2022

Please uninstall it first, and use mim to install mmcv-full:

pip install -U openmim
mim install mmcv-full

Or you need to specify version of mmcv-full by yourself:

pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.10.0/index.html

@xiaoyihit
Copy link
Author

CUDA_LAUNCH_BLOCKING=1 python tools/train.py configs/roi_trans/roi_trans_r50_fpn_1x_dota_le90.py

Successfully installed mmcv-full-1.5.2. Problem remains unsolved.

Traceback (most recent call last):
File "tools/train.py", line 192, in
main()
File "tools/train.py", line 188, in main
meta=meta)
File "/remote-home/xiaoyi/github/mmrotate-0.3.1/mmrotate/apis/train.py", line 141, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 130, in run
epoch_runner(data_loaders[i], **kwargs)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
return old_func(*args, **kwargs)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/remote-home/xiaoyi/github/mmrotate-0.3.1/mmrotate/models/detectors/two_stage.py", line 150, in forward_train
**kwargs)
File "/remote-home/xiaoyi/github/mmrotate-0.3.1/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 241, in forward_train
rcnn_train_cfg)
File "/remote-home/xiaoyi/github/mmrotate-0.3.1/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 158, in _bbox_forward_train
bbox_results = self._bbox_forward(stage, x, rois)
File "/remote-home/xiaoyi/github/mmrotate-0.3.1/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 129, in _bbox_forward
rois)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 198, in new_func
return old_func(*args, **kwargs)
File "/remote-home/xiaoyi/github/mmrotate-0.3.1/mmrotate/models/roi_heads/roi_extractors/rotate_single_level_roi_extractor.py", line 137, in forward
roi_feats_t = self.roi_layers[i](feats[i], rois
)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, kwargs)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 171, in forward
self.clockwise)
File "/opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 70, in forward
clockwise=ctx.clockwise)
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1634272178570/work/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7ff7bb4c7d62 in /opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c613 (0x7ff812db6613 in /opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x1a2 (0x7ff812db7022 in /opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7ff7bb4b1314 in /opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: + 0x294dd9 (0x7ff890e2edd9 in /opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xae2f59 (0x7ff89167cf59 in /opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object
) + 0x2b9 (0x7ff89167d279 in /opt/conda/envs/mmrotatev0.3.1/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #24: __libc_start_main + 0xe7 (0x7ff8cc357bf7 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

@zytx121
Copy link
Collaborator

zytx121 commented Jun 20, 2022

What command do you use to install the mmcv-full? The mmcv-full 1.5.2 have many compiled version, it looks like you install the error version.

@xiaoyihit
Copy link
Author

What command do you use to install the mmcv-full? The mmcv-full 1.5.2 have many compiled version, it looks like you install the error version.

pip install openmim
mim install mmcv-full

I will try to specify version of mmcv-full myself. After that I will update my results here.

@xiaoyihit
Copy link
Author

xiaoyihit commented Jun 20, 2022

pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.10.0/index.html

Installed mmcv-full 1.5.3 with command above.
Run python tools/train.py 'configs/roi_trans/roi_trans_r50_fpn_1x_dota_le90.py'
First time, we good.
Partial log:
2022-06-20 10:40:50,538 - mmrotate - INFO - Exp name: roi_trans_r50_fpn_1x_dota_le90.py
2022-06-20 10:40:50,539 - mmrotate - INFO - Epoch [1][1000/6400] lr: 5.000e-03, eta: 11:12:47, time: 0.517, data_time: 0.141, memory: 8646, loss_rpn_cls: 0.1138, loss_rpn_bbox: 0.0643, s0.loss_cls: 0.3465, s0.acc: 89.5312, s0.loss_bbox: 0.2405, s1.loss_cls: 0.2302, s1.acc: 93.2569, s1.loss_bbox: 0.3186, loss: 1.3138, grad_norm: 5.2039
2022-06-20 10:41:16,840 - mmrotate - INFO - Epoch [1][1050/6400] lr: 5.000e-03, eta: 11:11:57, time: 0.526, data_time: 0.142, memory: 8646, loss_rpn_cls: 0.0919, loss_rpn_bbox: 0.0588, s0.loss_cls: 0.3602, s0.acc: 89.0137, s0.loss_bbox: 0.3129, s1.loss_cls: 0.2029, s1.acc: 94.2634, s1.loss_bbox: 0.3078, loss: 1.3345, grad_norm: 5.4933
2022-06-20 10:41:43,955 - mmrotate - INFO - Epoch [1][1100/6400] lr: 5.000e-03, eta: 11:12:05, time: 0.542, data_time: 0.140, memory: 8646, loss_rpn_cls: 0.1178, loss_rpn_bbox: 0.0514, s0.loss_cls: 0.3247, s0.acc: 90.5137, s0.loss_bbox: 0.2465, s1.loss_cls: 0.2252, s1.acc: 93.7893, s1.loss_bbox: 0.3383, loss: 1.3040, grad_norm: 5.4267
2022-06-20 10:42:11,028 - mmrotate - INFO - Epoch [1][1150/6400] lr: 5.000e-03, eta: 11:12:07, time: 0.541, data_time: 0.141, memory: 8646, loss_rpn_cls: 0.0917, loss_rpn_bbox: 0.0559, s0.loss_cls: 0.3154, s0.acc: 90.3906, s0.loss_bbox: 0.2808, s1.loss_cls: 0.1796, s1.acc: 94.5897, s1.loss_bbox: 0.2755, loss: 1.1989, grad_norm: 5.0490
2022-06-20 10:42:37,217 - mmrotate - INFO - Epoch [1][1200/6400] lr: 5.000e-03, eta: 11:11:11, time: 0.524, data_time: 0.145, memory: 8646, loss_rpn_cls: 0.0983, loss_rpn_bbox: 0.0534, s0.loss_cls: 0.3391, s0.acc: 89.9336, s0.loss_bbox: 0.2663, s1.loss_cls: 0.2060, s1.acc: 94.0641, s1.loss_bbox: 0.2539, loss: 1.2170, grad_norm: 5.1556
2022-06-20 10:43:03,545 - mmrotate - INFO - Epoch [1][1250/6400] lr: 5.000e-03, eta: 11:10:26, time: 0.527, data_time: 0.140, memory: 8646, loss_rpn_cls: 0.0974, loss_rpn_bbox: 0.0611, s0.loss_cls: 0.3693, s0.acc: 89.2676, s0.loss_bbox: 0.2676, s1.loss_cls: 0.2668, s1.acc: 92.5827, s1.loss_bbox: 0.3967, loss: 1.4588, grad_norm: 5.7305
2022-06-20 10:43:30,119 - mmrotate - INFO - Epoch [1][1300/6400] lr: 5.000e-03, eta: 11:09:57, time: 0.531, data_time: 0.156, memory: 8646, loss_rpn_cls: 0.0830, loss_rpn_bbox: 0.0542, s0.loss_cls: 0.3160, s0.acc: 89.9883, s0.loss_bbox: 0.2747, s1.loss_cls: 0.2058, s1.acc: 93.9955, s1.loss_bbox: 0.3153, loss: 1.2489, grad_norm: 5.0566
2022-06-20 10:43:57,989 - mmrotate - INFO - Epoch [1][1350/6400] lr: 5.000e-03, eta: 11:10:40, time: 0.557, data_time: 0.149, memory: 8646, loss_rpn_cls: 0.0893, loss_rpn_bbox: 0.0552, s0.loss_cls: 0.3041, s0.acc: 90.4238, s0.loss_bbox: 0.2686, s1.loss_cls: 0.1922, s1.acc: 94.1526, s1.loss_bbox: 0.2801, loss: 1.1895, grad_norm: 4.8888

Second time, same bug occured.
Traceback (most recent call last):
File "tools/train.py", line 326, in
main()
File "tools/train.py", line 320, in main
meta=meta)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/apis/train.py", line 156, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 130, in run
epoch_runner(data_loaders[i], **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(*args, **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/detectors/two_stage.py", line 150, in forward_train
**kwargs)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 238, in forward_train
rcnn_train_cfg)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 155, in _bbox_forward_train
bbox_results = self._bbox_forward(stage, x, rois)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/roi_heads/roi_trans_roi_head.py", line 126, in _bbox_forward
rois)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 205, in new_func
return old_func(*args, **kwargs)
File "/remote-home/xiaoyi/mmrotate-main/mmrotate/models/roi_heads/roi_extractors/rotate_single_level_roi_extractor.py", line 133, in forward
roi_feats_t = self.roi_layers[i](feats[i], rois
)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, kwargs)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 171, in forward
self.clockwise)
File "/opt/conda/envs/mmrotate/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 70, in forward
clockwise=ctx.clockwise)
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1634272178570/work/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f7cd9c2bd62 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c613 (0x7f7d3151a613 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x1a2 (0x7f7d3151b022 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f7cd9c15314 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: + 0x294dd9 (0x7f7daf592dd9 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xae2f59 (0x7f7dafde0f59 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object
) + 0x2b9 (0x7f7dafde1279 in /opt/conda/envs/mmrotate/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #24: __libc_start_main + 0xe7 (0x7f7deaaf9bf7 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Uninstalled mmcv-full with command
pip uninstall mmcv-full
and retried. Bug remains.

I literally do not understand...

@xiaoyihit
Copy link
Author

xiaoyihit commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants