-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The training is normal, but the verification always fails #1427
Comments
It would be better to provide the cfg file, and we can repeat the problem you met |
I'm very sorry, it's a little late. The problem remains unsolved. `2022-04-22 15:43:11,473 - mmseg - INFO - workflow: [('train', 1)], max: 160000 iters Cannot get the env variable of GPU_STATUS_FILE, no data report to scheduler. This is not an error. It is because the scheduler of the cluster did not enable this feature. Cannot get the env variable of GPU_STATUS_FILE, no data report to scheduler. This is not an error. It is because the scheduler of the cluster did not enable this feature. Cannot get the env variable of GPU_STATUS_FILE, no data report to scheduler. This is not an error. It is because the scheduler of the cluster did not enable this feature. 2022-04-22 15:43:36,130 - mmseg - INFO - Iter [50/160000] lr: 1.959e-06, eta: 13:48:46, time: 0.311, data_time: 0.009, memory: 6417, decode.loss_ce: 4.0873, decode.acc_seg: 0.5042, loss: 4.0873 [ ] 16/2000, 0.9 task/s, elapsed: 17s, ETA: 2123sterminate called after throwing an instance of 'c10::Error' Killing subprocess 44386 |
From the log you provided
I think there might be some index out of the bound, would you modify some code of mmseg? |
The error log as follows:
decode.loss_ce: 4.0740, decode.acc_seg: 8.1740, loss: 4.0740
[ ] 6/2000, 0.3 task/s, elapsed: 22s, ETA: 7228sterminate called after throwing an instance of 'c10::Error'
what(): invalid device pointer: 0x7fddf8800000
Exception raised from free at /workspace/artifacts/paipytorch1.8/dist/ubuntu18.04-py3.6-cuda10.1/build/src/c10/cuda/CUDACachingAllocator.cpp:888 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x60 (0x7fe316ed0800 in /home/pai/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x77 (0x7fe316ecda67 in /home/pai/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x53a (0x7fe3171137aa in /home/pai/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: THCCachingVmemAllocator_raw_delete(void*) + 0xeb (0x7fe210febc2b in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xbf5a0a (0x7fe210e19a0a in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0xbeeb0c (0x7fe210e12b0c in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xbef1b7 (0x7fe210e131b7 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xbe7ff2 (0x7fe210e0bff2 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, bool) + 0xc0 (0x7fe210e0c8b0 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #9: + 0xc816dc (0x7fe210ea56dc in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #10: + 0xc81793 (0x7fe210ea5793 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #11: at::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, bool) + 0x208 (0x7fe24cd9e338 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x2b98434 (0x7fe24e6db434 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0x2b98973 (0x7fe24e6db973 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, bool) + 0x208 (0x7fe24cd9e338 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #15: at::native::_convolution(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long, bool, bool, bool, bool) + 0xcfa (0x7fe24c7b40aa in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x159dfa5 (0x7fe24d0e0fa5 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #17: + 0x15dac8b (0x7fe24d11dc8b in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #18: at::_convolution(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long, bool, bool, bool, bool) + 0x28d (0x7fe24cd9622d in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #19: at::native::convolution(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long) + 0xe5 (0x7fe24c7afe75 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #20: + 0x159ddfb (0x7fe24d0e0dfb in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #21: + 0x15db0f1 (0x7fe24d11e0f1 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #22: + 0x13d430d (0x7fe24cf1730d in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #23: at::convolution(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long) + 0xc9 (0x7fe24cd951c9 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #24: at::native::conv2d(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long) + 0x76 (0x7fe24c7afaf6 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #25: + 0x159e4e6 (0x7fe24d0e14e6 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #26: + 0x15db81e (0x7fe24d11e81e in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #27: + 0x13d4b32 (0x7fe24cf17b32 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #28: at::conv2d(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long) + 0xae (0x7fe24cd98d6e in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #29: + 0x48c235 (0x7fe2bf43c235 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #30: _PyCFunction_FastCallDict + 0x154 (0x55e9f1287a14 in /home/pai/bin/python)
frame #31: + 0x19aa5c (0x55e9f130fa5c in /home/pai/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x30a (0x55e9f133225a in /home/pai/bin/python)
frame #33: + 0x194c1b (0x55e9f1309c1b in /home/pai/bin/python)
frame #34: + 0x19ab35 (0x55e9f130fb35 in /home/pai/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x30a (0x55e9f133225a in /home/pai/bin/python)
frame #36: + 0x194c1b (0x55e9f1309c1b in /home/pai/bin/python)
frame #37: + 0x19ab35 (0x55e9f130fb35 in /home/pai/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x30a (0x55e9f133225a in /home/pai/bin/python)
frame #39: + 0x194166 (0x55e9f1309166 in /home/pai/bin/python)
frame #40: _PyFunction_FastCallDict + 0x3da (0x55e9f130a54a in /home/pai/bin/python)
frame #41: _PyObject_FastCallDict + 0x26f (0x55e9f1287ddf in /home/pai/bin/python)
frame #42: _PyObject_Call_Prepend + 0x63 (0x55e9f128c873 in /home/pai/bin/python)
frame #43: PyObject_Call + 0x3e (0x55e9f128781e in /home/pai/bin/python)
frame #44: _PyEval_EvalFrameDefault + 0x196b (0x55e9f13338bb in /home/pai/bin/python)
frame #45: + 0x193fd4 (0x55e9f1308fd4 in /home/pai/bin/python)
frame #46: _PyFunction_FastCallDict + 0x1bc (0x55e9f130a32c in /home/pai/bin/python)
frame #47: _PyObject_FastCallDict + 0x26f (0x55e9f1287ddf in /home/pai/bin/python)
frame #48: _PyObject_Call_Prepend + 0x63 (0x55e9f128c873 in /home/pai/bin/python)
frame #49: PyObject_Call + 0x3e (0x55e9f128781e in /home/pai/bin/python)
frame #50: + 0x16c211 (0x55e9f12e1211 in /home/pai/bin/python)
frame #51: _PyObject_FastCallDict + 0x8b (0x55e9f1287bfb in /home/pai/bin/python)
frame #52: + 0x19abae (0x55e9f130fbae in /home/pai/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x30a (0x55e9f133225a in /home/pai/bin/python)
frame #54: _PyFunction_FastCallDict + 0x11b (0x55e9f130a28b in /home/pai/bin/python)
frame #55: _PyObject_FastCallDict + 0x26f (0x55e9f1287ddf in /home/pai/bin/python)
frame #56: _PyObject_Call_Prepend + 0x63 (0x55e9f128c873 in /home/pai/bin/python)
frame #57: PyObject_Call + 0x3e (0x55e9f128781e in /home/pai/bin/python)
frame #58: _PyEval_EvalFrameDefault + 0x196b (0x55e9f13338bb in /home/pai/bin/python)
frame #59: + 0x193fd4 (0x55e9f1308fd4 in /home/pai/bin/python)
frame #60: _PyFunction_FastCallDict + 0x1bc (0x55e9f130a32c in /home/pai/bin/python)
frame #61: _PyObject_FastCallDict + 0x26f (0x55e9f1287ddf in /home/pai/bin/python)
frame #62: _PyObject_Call_Prepend + 0x63 (0x55e9f128c873 in /home/pai/bin/python)
frame #63: PyObject_Call + 0x3e (0x55e9f128781e in /home/pai/bin/python)
The text was updated successfully, but these errors were encountered: