Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: an illegal memory access was encountered #405

Closed
shnew opened this issue Jul 11, 2022 · 17 comments
Closed

CUDA error: an illegal memory access was encountered #405

shnew opened this issue Jul 11, 2022 · 17 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@shnew
Copy link

shnew commented Jul 11, 2022

First of all, thank you very much for your work.
But it always reports the following error when testing rotated_reppoints::
File "/data2/S/RepPoints_oriented/mmrotate-0.2.0/mmrotate/models/dense_heads/rotated_reppoints_head.py", line 1157, in _get_bboxes_single
scale_factor)
RuntimeError: CUDA error: an illegal memory access was encountered

How can this be solved? I tried different mmrotate versions and the solutions mentioned in issues, but none of them worked. I really hope to get your help Thank you

@yangxue0827
Copy link
Collaborator

Please run python mmrotate/utils/collect_env.py to collect necessary environment information and paste it here.
@LiWentomng Any suggestions?

@shnew
Copy link
Author

shnew commented Jul 11, 2022

fatal: Not a git repository (or any parent up to mount point /data2)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
sys.platform: linux
Python: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0]
CUDA available: True
GPU 0: Tesla V100-PCIE-32GB
GPU 1,2,3,4,5,6,7,8,9: GeForce RTX 2080 Ti
CUDA_HOME: /data1/shenhui/cuda-10.2:/data1/s/software
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
PyTorch: 1.8.0
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.2
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  • CuDNN 7.6.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.0
OpenCV: 4.6.0
MMCV: 1.5.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
MMRotate: 0.2.0+

@LiWentomng
Copy link
Contributor

Please try the blow suggestions:
(1) Change the image size (1024, 1024) to (960, 960) for test config. This can work while it decreases the performance slightly.
(2) Try to change the codes in the function _get_bboxes_single.
Change

            poly_pred = self.points2rotrect(points_pred, y_first=True)
            bbox_pos_center = points[:, :2].repeat(1, 4)
            polys = poly_pred * self.point_strides[level_idx] + bbox_pos_center
            bboxes = poly2obb(polys, self.version)

to

            pts_pred = points_pred.reshape(-1, self.num_points, 2)
            pts_pred_offsety = pts_pred[:, :, 0::2]
            pts_pred_offsetx = pts_pred[:, :, 1::2]
            pts_pred = torch.cat([pts_pred_offsetx, pts_pred_offsety],
                                 dim=2).reshape(-1, 2 * self.num_points)

            pts_pos_center = points[:, :2].repeat(1, self.num_points)
            pts = pts_pred * self.point_strides[level_idx] + pts_pos_center

            polys = min_area_polygons(pts)
            bboxes = poly2obb(polys, self.version)

Because there may exit the bug in cuda function-min_area_polygon when the input is small value. Transferring the predction of point offsets to the real positions in the whole image can avoid this issue sometimes.
(3) Use better GPU device, V100 may work better than 2080ti sometimes.

@shnew
Copy link
Author

shnew commented Jul 12, 2022

Thank you very much for your advice. Unfortunately, I've tried all of your suggestions, and the results show that sometimes they work and sometimes they don't. In order to test smoothly, my solution was to skip the images that would cause errors. Obviously this is not a perfect solution, so I hope someone can fundamentally solve this problem. Thank you!

@yangxue0827
Copy link
Collaborator

I guess it is because some sub-image in DOTA2.0 contain many objects, which causes some cuda operators to take up a lot of memory.

@shnew
Copy link
Author

shnew commented Jul 13, 2022

This could be the reason, and it also happened when I used DOTA1.0.

@austinmw
Copy link

austinmw commented Jul 22, 2022

Hi, I've faced this too, any workaround? (I ran using V100 GPUs)

@yangxue0827 yangxue0827 added the help wanted Extra attention is needed label Aug 13, 2022
@yangxue0827
Copy link
Collaborator

A successful solution: set smaller nms_pre

test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(iou_thr=0.4),
        max_per_img=2000))

@austinmw
Copy link

austinmw commented Aug 15, 2022

@yangxue0827 I still get this error even with that change:

    test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(iou_thr=0.4),
        max_per_img=2000))

2022-08-15 22:09:10,988 - mmrotate - INFO - Epoch [1][50/2357]#011lr: 3.189e-03, eta: 2:20:54, time: 1.813, data_time: 0.095, memory: 8398, loss_cls: 0.8521, loss_pts_init: 0.3214, loss_pts_refine: 0.2961, loss_spatial_init: 0.0152, loss_spatial_refine: 0.0002, loss: 1.4849, grad_norm: 1.8358
2022-08-15 22:10:44,014 - mmrotate - INFO - Epoch [1][100/2357]#011lr: 3.723e-03, eta: 2:21:14, time: 1.861, data_time: 0.016, memory: 8398, loss_cls: 0.3215, loss_pts_init: 0.3107, loss_pts_refine: 0.2996, loss_spatial_init: 0.0153, loss_spatial_refine: 0.0001, loss: 0.9473, grad_norm: 1.8034
2022-08-15 22:12:20,123 - mmrotate - INFO - Epoch [1][150/2357]#011lr: 4.256e-03, eta: 2:21:52, time: 1.922, data_time: 0.017, memory: 8398, loss_cls: 0.2361, loss_pts_init: 0.3174, loss_pts_refine: 0.2976, loss_spatial_init: 0.0155, loss_spatial_refine: 0.0002, loss: 0.8668, grad_norm: 1.6680
2022-08-15 22:13:48,646 - mmrotate - INFO - Epoch [1][200/2357]#011lr: 4.789e-03, eta: 2:18:32, time: 1.770, data_time: 0.018, memory: 8398, loss_cls: 0.2194, loss_pts_init: 0.3092, loss_pts_refine: 0.2999, loss_spatial_init: 0.0150, loss_spatial_refine: 0.0001, loss: 0.8437, grad_norm: 1.7740
2022-08-15 22:15:09,347 - mmrotate - INFO - Epoch [1][250/2357]#011lr: 5.323e-03, eta: 2:13:37, time: 1.614, data_time: 0.018, memory: 8398, loss_cls: 0.2059, loss_pts_init: 0.3087, loss_pts_refine: 0.2911, loss_spatial_init: 0.0165, loss_spatial_refine: 0.0002, loss: 0.8223, grad_norm: 1.6533
2022-08-15 22:16:37,447 - mmrotate - INFO - Epoch [1][300/2357]#011lr: 5.856e-03, eta: 2:11:42, time: 1.762, data_time: 0.016, memory: 8398, loss_cls: 0.1891, loss_pts_init: 0.3136, loss_pts_refine: 0.2940, loss_spatial_init: 0.0151, loss_spatial_refine: 0.0001, loss: 0.8120, grad_norm: 1.6122
2022-08-15 22:17:57,952 - mmrotate - INFO - Epoch [1][350/2357]#011lr: 6.389e-03, eta: 2:08:20, time: 1.610, data_time: 0.016, memory: 8398, loss_cls: 0.1903, loss_pts_init: 0.3057, loss_pts_refine: 0.2905, loss_spatial_init: 0.0162, loss_spatial_refine: 0.0002, loss: 0.8028, grad_norm: 1.6764
2022-08-15 22:19:23,388 - mmrotate - INFO - Epoch [1][400/2357]#011lr: 6.923e-03, eta: 2:06:22, time: 1.709, data_time: 0.017, memory: 8398, loss_cls: 0.1903, loss_pts_init: 0.3148, loss_pts_refine: 0.2962, loss_spatial_init: 0.0163, loss_spatial_refine: 0.0002, loss: 0.8178, grad_norm: 1.6627
2022-08-15 22:20:54,942 - mmrotate - INFO - Epoch [1][450/2357]#011lr: 7.456e-03, eta: 2:05:29, time: 1.831, data_time: 0.017, memory: 8398, loss_cls: 0.1913, loss_pts_init: 0.3114, loss_pts_refine: 0.2970, loss_spatial_init: 0.0158, loss_spatial_refine: 0.0002, loss: 0.8157, grad_norm: 1.7099
Traceback (most recent call last):
File "/opt/ml/code/mmrotate/tools/train.py", line 192, in
main()
File "/opt/ml/code/mmrotate/tools/train.py", line 181, in main
train_detector(
File "/opt/conda/lib/python3.8/site-packages/mmrotate/apis/train.py", line 141, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 130, in run
epoch_runner(data_loaders[i], **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/opt/conda/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 59, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/detectors/single_stage.py", line 81, in forward_train
losses = self.bbox_head.forward_train(x, img_metas, gt_bboxes,
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 335, in forward_train
losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 952, in loss
quality_assess_list, = multi_apply(
File "/opt/conda/lib/python3.8/site-packages/mmdet/core/utils/misc.py", line 30, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 480, in pointsets_quality_assessment
sampling_pts_pred_init = self.sampling_points(
File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 342, in sampling_points
ratio = torch.linspace(0, 1, points_num).to(device).repeat(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered

Even tried with the V100's that are configured with 32GB of memory instead of 16GB (AWS p3dn.24xlarge instances)

@chunibyo-wly
Copy link

set smaller nms_pre also occur this bug, any updates now?

@yangxue0827
Copy link
Collaborator

Adam->SGD according to #614 (comment)

@xiaolinyezi
Copy link

Please try the blow suggestions: (1) Change the image size (1024, 1024) to (960, 960) for test config. This can work while it decreases the performance slightly. (2) Try to change the codes in the function _get_bboxes_single. Change

            poly_pred = self.points2rotrect(points_pred, y_first=True)
            bbox_pos_center = points[:, :2].repeat(1, 4)
            polys = poly_pred * self.point_strides[level_idx] + bbox_pos_center
            bboxes = poly2obb(polys, self.version)

to

            pts_pred = points_pred.reshape(-1, self.num_points, 2)
            pts_pred_offsety = pts_pred[:, :, 0::2]
            pts_pred_offsetx = pts_pred[:, :, 1::2]
            pts_pred = torch.cat([pts_pred_offsetx, pts_pred_offsety],
                                 dim=2).reshape(-1, 2 * self.num_points)

            pts_pos_center = points[:, :2].repeat(1, self.num_points)
            pts = pts_pred * self.point_strides[level_idx] + pts_pos_center

            polys = min_area_polygons(pts)
            bboxes = poly2obb(polys, self.version)

Because there may exit the bug in cuda function-min_area_polygon when the input is small value. Transferring the predction of point offsets to the real positions in the whole image can avoid this issue sometimes. (3) Use better GPU device, V100 may work better than 2080ti sometimes.

Its worked for me, great!!!

@shnew shnew closed this as completed Dec 9, 2022
@pphgood
Copy link

pphgood commented Mar 20, 2023

@yangxue0827 I still get this error even with that change:

    test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(iou_thr=0.4),
        max_per_img=2000))

2022-08-15 22:09:10,988 - mmrotate - INFO - Epoch [1][50/2357]#011lr: 3.189e-03, eta: 2:20:54, time: 1.813, data_time: 0.095, memory: 8398, loss_cls: 0.8521, loss_pts_init: 0.3214, loss_pts_refine: 0.2961, loss_spatial_init: 0.0152, loss_spatial_refine: 0.0002, loss: 1.4849, grad_norm: 1.8358
2022-08-15 22:10:44,014 - mmrotate - INFO - Epoch [1][100/2357]#011lr: 3.723e-03, eta: 2:21:14, time: 1.861, data_time: 0.016, memory: 8398, loss_cls: 0.3215, loss_pts_init: 0.3107, loss_pts_refine: 0.2996, loss_spatial_init: 0.0153, loss_spatial_refine: 0.0001, loss: 0.9473, grad_norm: 1.8034
2022-08-15 22:12:20,123 - mmrotate - INFO - Epoch [1][150/2357]#011lr: 4.256e-03, eta: 2:21:52, time: 1.922, data_time: 0.017, memory: 8398, loss_cls: 0.2361, loss_pts_init: 0.3174, loss_pts_refine: 0.2976, loss_spatial_init: 0.0155, loss_spatial_refine: 0.0002, loss: 0.8668, grad_norm: 1.6680
2022-08-15 22:13:48,646 - mmrotate - INFO - Epoch [1][200/2357]#011lr: 4.789e-03, eta: 2:18:32, time: 1.770, data_time: 0.018, memory: 8398, loss_cls: 0.2194, loss_pts_init: 0.3092, loss_pts_refine: 0.2999, loss_spatial_init: 0.0150, loss_spatial_refine: 0.0001, loss: 0.8437, grad_norm: 1.7740
2022-08-15 22:15:09,347 - mmrotate - INFO - Epoch [1][250/2357]#011lr: 5.323e-03, eta: 2:13:37, time: 1.614, data_time: 0.018, memory: 8398, loss_cls: 0.2059, loss_pts_init: 0.3087, loss_pts_refine: 0.2911, loss_spatial_init: 0.0165, loss_spatial_refine: 0.0002, loss: 0.8223, grad_norm: 1.6533
2022-08-15 22:16:37,447 - mmrotate - INFO - Epoch [1][300/2357]#011lr: 5.856e-03, eta: 2:11:42, time: 1.762, data_time: 0.016, memory: 8398, loss_cls: 0.1891, loss_pts_init: 0.3136, loss_pts_refine: 0.2940, loss_spatial_init: 0.0151, loss_spatial_refine: 0.0001, loss: 0.8120, grad_norm: 1.6122
2022-08-15 22:17:57,952 - mmrotate - INFO - Epoch [1][350/2357]#011lr: 6.389e-03, eta: 2:08:20, time: 1.610, data_time: 0.016, memory: 8398, loss_cls: 0.1903, loss_pts_init: 0.3057, loss_pts_refine: 0.2905, loss_spatial_init: 0.0162, loss_spatial_refine: 0.0002, loss: 0.8028, grad_norm: 1.6764
2022-08-15 22:19:23,388 - mmrotate - INFO - Epoch [1][400/2357]#011lr: 6.923e-03, eta: 2:06:22, time: 1.709, data_time: 0.017, memory: 8398, loss_cls: 0.1903, loss_pts_init: 0.3148, loss_pts_refine: 0.2962, loss_spatial_init: 0.0163, loss_spatial_refine: 0.0002, loss: 0.8178, grad_norm: 1.6627
2022-08-15 22:20:54,942 - mmrotate - INFO - Epoch [1][450/2357]#011lr: 7.456e-03, eta: 2:05:29, time: 1.831, data_time: 0.017, memory: 8398, loss_cls: 0.1913, loss_pts_init: 0.3114, loss_pts_refine: 0.2970, loss_spatial_init: 0.0158, loss_spatial_refine: 0.0002, loss: 0.8157, grad_norm: 1.7099
Traceback (most recent call last):
File "/opt/ml/code/mmrotate/tools/train.py", line 192, in
main()
File "/opt/ml/code/mmrotate/tools/train.py", line 181, in main
train_detector(
File "/opt/conda/lib/python3.8/site-packages/mmrotate/apis/train.py", line 141, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 130, in run
epoch_runner(data_loaders[i], **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/opt/conda/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 59, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/detectors/single_stage.py", line 81, in forward_train
losses = self.bbox_head.forward_train(x, img_metas, gt_bboxes,
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 335, in forward_train
losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 952, in loss
quality_assess_list, = multi_apply(
File "/opt/conda/lib/python3.8/site-packages/mmdet/core/utils/misc.py", line 30, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 480, in pointsets_quality_assessment
sampling_pts_pred_init = self.sampling_points(
File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 342, in sampling_points
ratio = torch.linspace(0, 1, points_num).to(device).repeat(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered

Even tried with the V100's that are configured with 32GB of memory instead of 16GB (AWS p3dn.24xlarge instances)

@yangxue0827 I still get this error even with that change:

    test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(iou_thr=0.4),
        max_per_img=2000))

2022-08-15 22:09:10,988 - mmrotate - INFO - Epoch [1][50/2357]#011lr: 3.189e-03, eta: 2:20:54, time: 1.813, data_time: 0.095, memory: 8398, loss_cls: 0.8521, loss_pts_init: 0.3214, loss_pts_refine: 0.2961, loss_spatial_init: 0.0152, loss_spatial_refine: 0.0002, loss: 1.4849, grad_norm: 1.8358
2022-08-15 22:10:44,014 - mmrotate - INFO - Epoch [1][100/2357]#011lr: 3.723e-03, eta: 2:21:14, time: 1.861, data_time: 0.016, memory: 8398, loss_cls: 0.3215, loss_pts_init: 0.3107, loss_pts_refine: 0.2996, loss_spatial_init: 0.0153, loss_spatial_refine: 0.0001, loss: 0.9473, grad_norm: 1.8034
2022-08-15 22:12:20,123 - mmrotate - INFO - Epoch [1][150/2357]#011lr: 4.256e-03, eta: 2:21:52, time: 1.922, data_time: 0.017, memory: 8398, loss_cls: 0.2361, loss_pts_init: 0.3174, loss_pts_refine: 0.2976, loss_spatial_init: 0.0155, loss_spatial_refine: 0.0002, loss: 0.8668, grad_norm: 1.6680
2022-08-15 22:13:48,646 - mmrotate - INFO - Epoch [1][200/2357]#011lr: 4.789e-03, eta: 2:18:32, time: 1.770, data_time: 0.018, memory: 8398, loss_cls: 0.2194, loss_pts_init: 0.3092, loss_pts_refine: 0.2999, loss_spatial_init: 0.0150, loss_spatial_refine: 0.0001, loss: 0.8437, grad_norm: 1.7740
2022-08-15 22:15:09,347 - mmrotate - INFO - Epoch [1][250/2357]#011lr: 5.323e-03, eta: 2:13:37, time: 1.614, data_time: 0.018, memory: 8398, loss_cls: 0.2059, loss_pts_init: 0.3087, loss_pts_refine: 0.2911, loss_spatial_init: 0.0165, loss_spatial_refine: 0.0002, loss: 0.8223, grad_norm: 1.6533
2022-08-15 22:16:37,447 - mmrotate - INFO - Epoch [1][300/2357]#011lr: 5.856e-03, eta: 2:11:42, time: 1.762, data_time: 0.016, memory: 8398, loss_cls: 0.1891, loss_pts_init: 0.3136, loss_pts_refine: 0.2940, loss_spatial_init: 0.0151, loss_spatial_refine: 0.0001, loss: 0.8120, grad_norm: 1.6122
2022-08-15 22:17:57,952 - mmrotate - INFO - Epoch [1][350/2357]#011lr: 6.389e-03, eta: 2:08:20, time: 1.610, data_time: 0.016, memory: 8398, loss_cls: 0.1903, loss_pts_init: 0.3057, loss_pts_refine: 0.2905, loss_spatial_init: 0.0162, loss_spatial_refine: 0.0002, loss: 0.8028, grad_norm: 1.6764
2022-08-15 22:19:23,388 - mmrotate - INFO - Epoch [1][400/2357]#011lr: 6.923e-03, eta: 2:06:22, time: 1.709, data_time: 0.017, memory: 8398, loss_cls: 0.1903, loss_pts_init: 0.3148, loss_pts_refine: 0.2962, loss_spatial_init: 0.0163, loss_spatial_refine: 0.0002, loss: 0.8178, grad_norm: 1.6627
2022-08-15 22:20:54,942 - mmrotate - INFO - Epoch [1][450/2357]#011lr: 7.456e-03, eta: 2:05:29, time: 1.831, data_time: 0.017, memory: 8398, loss_cls: 0.1913, loss_pts_init: 0.3114, loss_pts_refine: 0.2970, loss_spatial_init: 0.0158, loss_spatial_refine: 0.0002, loss: 0.8157, grad_norm: 1.7099
Traceback (most recent call last):
File "/opt/ml/code/mmrotate/tools/train.py", line 192, in
main()
File "/opt/ml/code/mmrotate/tools/train.py", line 181, in main
train_detector(
File "/opt/conda/lib/python3.8/site-packages/mmrotate/apis/train.py", line 141, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 130, in run
epoch_runner(data_loaders[i], **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/opt/conda/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 59, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/detectors/single_stage.py", line 81, in forward_train
losses = self.bbox_head.forward_train(x, img_metas, gt_bboxes,
File "/opt/conda/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 335, in forward_train
losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 952, in loss
quality_assess_list, = multi_apply(
File "/opt/conda/lib/python3.8/site-packages/mmdet/core/utils/misc.py", line 30, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 480, in pointsets_quality_assessment
sampling_pts_pred_init = self.sampling_points(
File "/opt/conda/lib/python3.8/site-packages/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 342, in sampling_points
ratio = torch.linspace(0, 1, points_num).to(device).repeat(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered

Even tried with the V100's that are configured with 32GB of memory instead of 16GB (AWS p3dn.24xlarge instances)

I have the same problem with you. I am also a V100. I have no problem in training, but this problem will occur when I test,Have you solved it yet

@pphgood
Copy link

pphgood commented Mar 20, 2023

Thank you very much for your advice. Unfortunately, I've tried all of your suggestions, and the results show that sometimes they work and sometimes they don't. In order to test smoothly, my solution was to skip the images that would cause errors. Obviously this is not a perfect solution, so I hope someone can fundamentally solve this problem. Thank you!

I met the same problem when verifying on the test set. I tried to modify the image size to (960, 960) and nms_pre=1000, but sometimes it works and sometimes it doesn't. May I ask if you have solved this problem now?

@silencersai
Copy link

I met the same problem and it didn't work with suitble nms_pre. So any update of this bug now?

@GisRookie
Copy link

I met the same problem too, anyone can help?

@soHardToHaveAName
Copy link

Seems there is no feasible solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests