Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The training is normal, but the verification always fails #1427

Closed
dongbo811 opened this issue Mar 28, 2022 · 3 comments
Closed

The training is normal, but the verification always fails #1427

dongbo811 opened this issue Mar 28, 2022 · 3 comments
Assignees

Comments

@dongbo811
Copy link

The error log as follows:

decode.loss_ce: 4.0740, decode.acc_seg: 8.1740, loss: 4.0740
[ ] 6/2000, 0.3 task/s, elapsed: 22s, ETA: 7228sterminate called after throwing an instance of 'c10::Error'
what(): invalid device pointer: 0x7fddf8800000
Exception raised from free at /workspace/artifacts/paipytorch1.8/dist/ubuntu18.04-py3.6-cuda10.1/build/src/c10/cuda/CUDACachingAllocator.cpp:888 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x60 (0x7fe316ed0800 in /home/pai/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x77 (0x7fe316ecda67 in /home/pai/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x53a (0x7fe3171137aa in /home/pai/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: THCCachingVmemAllocator_raw_delete(void*) + 0xeb (0x7fe210febc2b in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xbf5a0a (0x7fe210e19a0a in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0xbeeb0c (0x7fe210e12b0c in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xbef1b7 (0x7fe210e131b7 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xbe7ff2 (0x7fe210e0bff2 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, bool) + 0xc0 (0x7fe210e0c8b0 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #9: + 0xc816dc (0x7fe210ea56dc in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #10: + 0xc81793 (0x7fe210ea5793 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #11: at::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, bool) + 0x208 (0x7fe24cd9e338 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x2b98434 (0x7fe24e6db434 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0x2b98973 (0x7fe24e6db973 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, bool) + 0x208 (0x7fe24cd9e338 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #15: at::native::_convolution(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long, bool, bool, bool, bool) + 0xcfa (0x7fe24c7b40aa in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x159dfa5 (0x7fe24d0e0fa5 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #17: + 0x15dac8b (0x7fe24d11dc8b in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #18: at::_convolution(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long, bool, bool, bool, bool) + 0x28d (0x7fe24cd9622d in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #19: at::native::convolution(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long) + 0xe5 (0x7fe24c7afe75 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #20: + 0x159ddfb (0x7fe24d0e0dfb in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #21: + 0x15db0f1 (0x7fe24d11e0f1 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #22: + 0x13d430d (0x7fe24cf1730d in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #23: at::convolution(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long) + 0xc9 (0x7fe24cd951c9 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #24: at::native::conv2d(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long) + 0x76 (0x7fe24c7afaf6 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #25: + 0x159e4e6 (0x7fe24d0e14e6 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #26: + 0x15db81e (0x7fe24d11e81e in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #27: + 0x13d4b32 (0x7fe24cf17b32 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #28: at::conv2d(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long) + 0xae (0x7fe24cd98d6e in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #29: + 0x48c235 (0x7fe2bf43c235 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #30: _PyCFunction_FastCallDict + 0x154 (0x55e9f1287a14 in /home/pai/bin/python)
frame #31: + 0x19aa5c (0x55e9f130fa5c in /home/pai/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x30a (0x55e9f133225a in /home/pai/bin/python)
frame #33: + 0x194c1b (0x55e9f1309c1b in /home/pai/bin/python)
frame #34: + 0x19ab35 (0x55e9f130fb35 in /home/pai/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x30a (0x55e9f133225a in /home/pai/bin/python)
frame #36: + 0x194c1b (0x55e9f1309c1b in /home/pai/bin/python)
frame #37: + 0x19ab35 (0x55e9f130fb35 in /home/pai/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x30a (0x55e9f133225a in /home/pai/bin/python)
frame #39: + 0x194166 (0x55e9f1309166 in /home/pai/bin/python)
frame #40: _PyFunction_FastCallDict + 0x3da (0x55e9f130a54a in /home/pai/bin/python)
frame #41: _PyObject_FastCallDict + 0x26f (0x55e9f1287ddf in /home/pai/bin/python)
frame #42: _PyObject_Call_Prepend + 0x63 (0x55e9f128c873 in /home/pai/bin/python)
frame #43: PyObject_Call + 0x3e (0x55e9f128781e in /home/pai/bin/python)
frame #44: _PyEval_EvalFrameDefault + 0x196b (0x55e9f13338bb in /home/pai/bin/python)
frame #45: + 0x193fd4 (0x55e9f1308fd4 in /home/pai/bin/python)
frame #46: _PyFunction_FastCallDict + 0x1bc (0x55e9f130a32c in /home/pai/bin/python)
frame #47: _PyObject_FastCallDict + 0x26f (0x55e9f1287ddf in /home/pai/bin/python)
frame #48: _PyObject_Call_Prepend + 0x63 (0x55e9f128c873 in /home/pai/bin/python)
frame #49: PyObject_Call + 0x3e (0x55e9f128781e in /home/pai/bin/python)
frame #50: + 0x16c211 (0x55e9f12e1211 in /home/pai/bin/python)
frame #51: _PyObject_FastCallDict + 0x8b (0x55e9f1287bfb in /home/pai/bin/python)
frame #52: + 0x19abae (0x55e9f130fbae in /home/pai/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x30a (0x55e9f133225a in /home/pai/bin/python)
frame #54: _PyFunction_FastCallDict + 0x11b (0x55e9f130a28b in /home/pai/bin/python)
frame #55: _PyObject_FastCallDict + 0x26f (0x55e9f1287ddf in /home/pai/bin/python)
frame #56: _PyObject_Call_Prepend + 0x63 (0x55e9f128c873 in /home/pai/bin/python)
frame #57: PyObject_Call + 0x3e (0x55e9f128781e in /home/pai/bin/python)
frame #58: _PyEval_EvalFrameDefault + 0x196b (0x55e9f13338bb in /home/pai/bin/python)
frame #59: + 0x193fd4 (0x55e9f1308fd4 in /home/pai/bin/python)
frame #60: _PyFunction_FastCallDict + 0x1bc (0x55e9f130a32c in /home/pai/bin/python)
frame #61: _PyObject_FastCallDict + 0x26f (0x55e9f1287ddf in /home/pai/bin/python)
frame #62: _PyObject_Call_Prepend + 0x63 (0x55e9f128c873 in /home/pai/bin/python)
frame #63: PyObject_Call + 0x3e (0x55e9f128781e in /home/pai/bin/python)

@MeowZheng
Copy link
Collaborator

It would be better to provide the cfg file, and we can repeat the problem you met

@dongbo811
Copy link
Author

dongbo811 commented Apr 22, 2022

I'm very sorry, it's a little late. The problem remains unsolved.

`2022-04-22 15:43:11,473 - mmseg - INFO - workflow: [('train', 1)], max: 160000 iters
2022-04-22 15:43:11,473 - mmseg - INFO - Checkpoints will be saved to /home/admin/workspace/savexxxxxxx/_512x512_160k_ade20k by HardDiskBackend.
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:362] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [81, 1, 7, 7], strides() = [49, 1, 7, 1]
bucket_view.sizes() = [81, 1, 7, 7], strides() = [49, 49, 7, 1] (function operator())
[W reducer.cpp:362] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [81, 1, 7, 7], strides() = [49, 1, 7, 1]
bucket_view.sizes() = [81, 1, 7, 7], strides() = [49, 49, 7, 1] (function operator())
[W reducer.cpp:362] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [81, 1, 7, 7], strides() = [49, 1, 7, 1]
bucket_view.sizes() = [81, 1, 7, 7], strides() = [49, 49, 7, 1] (function operator())
[W reducer.cpp:362] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [81, 1, 7, 7], strides() = [49, 1, 7, 1]
bucket_view.sizes() = [81, 1, 7, 7], strides() = [49, 49, 7, 1] (function operator())
Cannot get the env variable of GPU_STATUS_FILE, no data report to scheduler. This is not an error. It is because the scheduler of the cluster did not enable this feature.

Cannot get the env variable of GPU_STATUS_FILE, no data report to scheduler. This is not an error. It is because the scheduler of the cluster did not enable this feature.

Cannot get the env variable of GPU_STATUS_FILE, no data report to scheduler. This is not an error. It is because the scheduler of the cluster did not enable this feature.

Cannot get the env variable of GPU_STATUS_FILE, no data report to scheduler. This is not an error. It is because the scheduler of the cluster did not enable this feature.

2022-04-22 15:43:36,130 - mmseg - INFO - Iter [50/160000] lr: 1.959e-06, eta: 13:48:46, time: 0.311, data_time: 0.009, memory: 6417, decode.loss_ce: 4.0873, decode.acc_seg: 0.5042, loss: 4.0873
2022-04-22 15:43:48,631 - mmseg - INFO - Iter [100/160000] lr: 3.958e-06, eta: 12:27:09, time: 0.250, data_time: 0.006, memory: 6417, decode.loss_ce: 4.0507, decode.acc_seg: 0.3681, loss: 4.0507
[ ] 12/2000, 0.8 task/s, elapsed: 15s, ETA: 2417sterminate called after throwing an instance of 'c10::Error'
what(): invalid device pointer: 0x7f3c73800000
Exception raised from free at /workspace/artifacts/paipytorch1.8/dist/ubuntu18.04-py3.6-cuda10.1/build/src/c10/cuda/CUDACachingAllocator.cpp:888 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x60 (0x7f4013811800 in /home/pai/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x77 (0x7f401380ea67 in /home/pai/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x53a (0x7f4013a547aa in /home/pai/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: THCCachingVmemAllocator_raw_delete(void*) + 0xeb (0x7f4016f5ba8b in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x312df4a (0x7f4016d97f4a in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0x312748c (0x7f4016d9148c in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0x3127b37 (0x7f4016d91b37 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x3120ed2 (0x7f4016d8aed2 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, bool) + 0xc0 (0x7f4016d8b790 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #9: + 0x31afd3c (0x7f4016e19d3c in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #10: + 0x31afdf3 (0x7f4016e19df3 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #11: at::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, bool) + 0x208 (0x7f40452fa6b8 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x2ba60a4 (0x7f4046c380a4 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0x2ba65e3 (0x7f4046c385e3 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, bool) + 0x208 (0x7f40452fa6b8 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #15: at::native::_convolution(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long, bool, bool, bool, bool) + 0xcfa (0x7f4044d1001a in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x15ab325 (0x7f404563d325 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #17: + 0x15e800b (0x7f404567a00b in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #18: at::_convolution(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long, bool, bool, bool, bool) + 0x28d (0x7f40452f25ad in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #19: at::native::convolution(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long) + 0xe5 (0x7f4044d0bde5 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #20: + 0x15ab17b (0x7f404563d17b in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #21: + 0x15e8471 (0x7f404567a471 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #22: + 0x13e168d (0x7f404547368d in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #23: at::convolution(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long) + 0xc9 (0x7f40452f1549 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #24: at::native::conv2d(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long) + 0x76 (0x7f4044d0ba66 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #25: + 0x15ab866 (0x7f404563d866 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #26: + 0x15e8b9e (0x7f404567ab9e in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #27: + 0x13e1eb2 (0x7f4045473eb2 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #28: at::conv2d(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long) + 0xae (0x7f40452f50ee in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #29: + 0x4928b5 (0x7f404d2db8b5 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #30: _PyCFunction_FastCallDict + 0x154 (0x55d31edfaa14 in /home/pai/bin/python)
frame #31: + 0x19aa5c (0x55d31ee82a5c in /home/pai/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x30a (0x55d31eea525a in /home/pai/bin/python)
frame #33: + 0x194c1b (0x55d31ee7cc1b in /home/pai/bin/python)
frame #34: + 0x19ab35 (0x55d31ee82b35 in /home/pai/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x30a (0x55d31eea525a in /home/pai/bin/python)
frame #36: _PyFunction_FastCallDict + 0x11b (0x55d31ee7d28b in /home/pai/bin/python)
frame #37: _PyObject_FastCallDict + 0x26f (0x55d31edfaddf in /home/pai/bin/python)
frame #38: _PyObject_Call_Prepend + 0x63 (0x55d31edff873 in /home/pai/bin/python)
frame #39: PyObject_Call + 0x3e (0x55d31edfa81e in /home/pai/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x196b (0x55d31eea68bb in /home/pai/bin/python)
frame #41: + 0x193fd4 (0x55d31ee7bfd4 in /home/pai/bin/python)
frame #42: _PyFunction_FastCallDict + 0x1bc (0x55d31ee7d32c in /home/pai/bin/python)
frame #43: _PyObject_FastCallDict + 0x26f (0x55d31edfaddf in /home/pai/bin/python)
frame #44: _PyObject_Call_Prepend + 0x63 (0x55d31edff873 in /home/pai/bin/python)
frame #45: PyObject_Call + 0x3e (0x55d31edfa81e in /home/pai/bin/python)
frame #46: + 0x16c211 (0x55d31ee54211 in /home/pai/bin/python)
frame #47: _PyObject_FastCallDict + 0x8b (0x55d31edfabfb in /home/pai/bin/python)
frame #48: + 0x19abae (0x55d31ee82bae in /home/pai/bin/python)
frame #49: _PyEval_EvalFrameDefault + 0x30a (0x55d31eea525a in /home/pai/bin/python)
frame #50: + 0x193fd4 (0x55d31ee7bfd4 in /home/pai/bin/python)
frame #51: + 0x194e51 (0x55d31ee7ce51 in /home/pai/bin/python)
frame #52: + 0x19ab35 (0x55d31ee82b35 in /home/pai/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x30a (0x55d31eea525a in /home/pai/bin/python)
frame #54: + 0x193fd4 (0x55d31ee7bfd4 in /home/pai/bin/python)
frame #55: _PyFunction_FastCallDict + 0x3da (0x55d31ee7d54a in /home/pai/bin/python)
frame #56: _PyObject_FastCallDict + 0x26f (0x55d31edfaddf in /home/pai/bin/python)
frame #57: _PyObject_Call_Prepend + 0x63 (0x55d31edff873 in /home/pai/bin/python)
frame #58: PyObject_Call + 0x3e (0x55d31edfa81e in /home/pai/bin/python)
frame #59: _PyEval_EvalFrameDefault + 0x196b (0x55d31eea68bb in /home/pai/bin/python)
frame #60: + 0x193fd4 (0x55d31ee7bfd4 in /home/pai/bin/python)
frame #61: _PyFunction_FastCallDict + 0x3da (0x55d31ee7d54a in /home/pai/bin/python)
frame #62: _PyObject_FastCallDict + 0x26f (0x55d31edfaddf in /home/pai/bin/python)
frame #63: _PyObject_Call_Prepend + 0x63 (0x55d31edff873 in /home/pai/bin/python)

[ ] 16/2000, 0.9 task/s, elapsed: 17s, ETA: 2123sterminate called after throwing an instance of 'c10::Error'
what(): invalid device pointer: 0x7f15bd800000
Exception raised from free at /workspace/artifacts/paipytorch1.8/dist/ubuntu18.04-py3.6-cuda10.1/build/src/c10/cuda/CUDACachingAllocator.cpp:888 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x60 (0x7f196384e800 in /home/pai/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x77 (0x7f196384ba67 in /home/pai/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x53a (0x7f1963a917aa in /home/pai/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: THCCachingVmemAllocator_raw_delete(void*) + 0xeb (0x7f1966f98a8b in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x312df4a (0x7f1966dd4f4a in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0x312748c (0x7f1966dce48c in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0x3127b37 (0x7f1966dceb37 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x3120ed2 (0x7f1966dc7ed2 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, bool) + 0xc0 (0x7f1966dc8790 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #9: + 0x31afd3c (0x7f1966e56d3c in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #10: + 0x31afdf3 (0x7f1966e56df3 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #11: at::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, bool) + 0x208 (0x7f19953376b8 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x2ba60a4 (0x7f1996c750a4 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0x2ba65e3 (0x7f1996c755e3 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, bool) + 0x208 (0x7f19953376b8 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #15: at::native::_convolution(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long, bool, bool, bool, bool) + 0xcfa (0x7f1994d4d01a in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x15ab325 (0x7f199567a325 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #17: + 0x15e800b (0x7f19956b700b in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #18: at::_convolution(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long, bool, bool, bool, bool) + 0x28d (0x7f199532f5ad in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #19: at::native::convolution(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long) + 0xe5 (0x7f1994d48de5 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #20: + 0x15ab17b (0x7f199567a17b in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #21: + 0x15e8471 (0x7f19956b7471 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #22: + 0x13e168d (0x7f19954b068d in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #23: at::convolution(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long) + 0xc9 (0x7f199532e549 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #24: at::native::conv2d(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long) + 0x76 (0x7f1994d48a66 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #25: + 0x15ab866 (0x7f199567a866 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #26: + 0x15e8b9e (0x7f19956b7b9e in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #27: + 0x13e1eb2 (0x7f19954b0eb2 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #28: at::conv2d(at::Tensor const&, at::Tensor const&, c10::optionalat::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long) + 0xae (0x7f19953320ee in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #29: + 0x4928b5 (0x7f199d3188b5 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #30: _PyCFunction_FastCallDict + 0x154 (0x5644f09cea14 in /home/pai/bin/python)
frame #31: + 0x19aa5c (0x5644f0a56a5c in /home/pai/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x30a (0x5644f0a7925a in /home/pai/bin/python)
frame #33: + 0x194c1b (0x5644f0a50c1b in /home/pai/bin/python)
frame #34: + 0x19ab35 (0x5644f0a56b35 in /home/pai/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x30a (0x5644f0a7925a in /home/pai/bin/python)
frame #36: _PyFunction_FastCallDict + 0x11b (0x5644f0a5128b in /home/pai/bin/python)
frame #37: _PyObject_FastCallDict + 0x26f (0x5644f09ceddf in /home/pai/bin/python)
frame #38: _PyObject_Call_Prepend + 0x63 (0x5644f09d3873 in /home/pai/bin/python)
frame #39: PyObject_Call + 0x3e (0x5644f09ce81e in /home/pai/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x196b (0x5644f0a7a8bb in /home/pai/bin/python)
frame #41: + 0x193fd4 (0x5644f0a4ffd4 in /home/pai/bin/python)
frame #42: _PyFunction_FastCallDict + 0x1bc (0x5644f0a5132c in /home/pai/bin/python)
frame #43: _PyObject_FastCallDict + 0x26f (0x5644f09ceddf in /home/pai/bin/python)
frame #44: _PyObject_Call_Prepend + 0x63 (0x5644f09d3873 in /home/pai/bin/python)
frame #45: PyObject_Call + 0x3e (0x5644f09ce81e in /home/pai/bin/python)
frame #46: + 0x16c211 (0x5644f0a28211 in /home/pai/bin/python)
frame #47: _PyObject_FastCallDict + 0x8b (0x5644f09cebfb in /home/pai/bin/python)
frame #48: + 0x19abae (0x5644f0a56bae in /home/pai/bin/python)
frame #49: _PyEval_EvalFrameDefault + 0x30a (0x5644f0a7925a in /home/pai/bin/python)
frame #50: + 0x193fd4 (0x5644f0a4ffd4 in /home/pai/bin/python)
frame #51: + 0x194e51 (0x5644f0a50e51 in /home/pai/bin/python)
frame #52: + 0x19ab35 (0x5644f0a56b35 in /home/pai/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x30a (0x5644f0a7925a in /home/pai/bin/python)
frame #54: + 0x193fd4 (0x5644f0a4ffd4 in /home/pai/bin/python)
frame #55: _PyFunction_FastCallDict + 0x3da (0x5644f0a5154a in /home/pai/bin/python)
frame #56: _PyObject_FastCallDict + 0x26f (0x5644f09ceddf in /home/pai/bin/python)
frame #57: _PyObject_Call_Prepend + 0x63 (0x5644f09d3873 in /home/pai/bin/python)
frame #58: PyObject_Call + 0x3e (0x5644f09ce81e in /home/pai/bin/python)
frame #59: _PyEval_EvalFrameDefault + 0x196b (0x5644f0a7a8bb in /home/pai/bin/python)
frame #60: + 0x193fd4 (0x5644f0a4ffd4 in /home/pai/bin/python)
frame #61: _PyFunction_FastCallDict + 0x3da (0x5644f0a5154a in /home/pai/bin/python)
frame #62: _PyObject_FastCallDict + 0x26f (0x5644f09ceddf in /home/pai/bin/python)
frame #63: _PyObject_Call_Prepend + 0x63 (0x5644f09d3873 in /home/pai/bin/python)

Killing subprocess 44386
Killing subprocess 44387
Killing subprocess 44388
Killing subprocess 44389
Traceback (most recent call last):
File "/home/pai/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/pai/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/pai/lib/python3.6/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/home/pai/lib/python3.6/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/pai/lib/python3.6/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/pai/bin/python', '-u', './train.py', '--local_rank=3', 'configs/ham/hamnet_light_van_tiny_d256_512x512_160k_ade20k.py', '--seed', '0', '--launcher', 'pytorch']' died with <Signals.SIGABRT: 6>.`

@MeowZheng
Copy link
Collaborator

MeowZheng commented May 3, 2022

From the log you provided

[ ] 12/2000, 0.8 task/s, elapsed: 15s, ETA: 2417sterminate called after throwing an instance of 'c10::Error'
what(): invalid device pointer: 0x7f3c73800000
Exception raised from free at /workspace/artifacts/paipytorch1.8/dist/ubuntu18.04-py3.6-cuda10.1/build/src/c10/cuda/CUDACachingAllocator.cpp:888 (most recent call first):

I think there might be some index out of the bound, would you modify some code of mmseg?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants