Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentfault in multiprocessing DataLoader when training on Kunpeng cpu #2506

Closed
MengqingCao opened this issue Apr 26, 2024 · 1 comment · Fixed by #2507
Closed

Segmentfault in multiprocessing DataLoader when training on Kunpeng cpu #2506

MengqingCao opened this issue Apr 26, 2024 · 1 comment · Fixed by #2507

Comments

@MengqingCao
Copy link
Contributor

Describe the bug
Segmentfault occurs when the train.py is running. It happens when creating the multi-processes in DataLoader.

the log:

/home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/_jit_internal.py:730: FutureWarning: ignore(True) has been deprecated. TorchScript will now drop the function call on compilation. Use torch.jit.unused now. {}
  warnings.warn(
2024-04-26 10:48:01,380 INFO use char tokenizer
2024-04-26 10:48:01,381 INFO training on multiple gpus, this gpu 0, rank 0, world_size 1
{'encoder': 'conformer', 'encoder_conf': {'output_size': 256, 'attention_heads': 4, 'linear_units': 2048, 'num_blocks': 12, 'dropout_rate': 0.1, 'positional_dropout_rate': 0.1, 'attention_dropout_rate': 0.0, 'input_layer': 'conv2d', 'normalize_before': True, 'cnn_module_kernel': 15, 'use_cnn_module': True, 'activation_type': 'swish', 'pos_enc_layer_type': 'rel_pos', 'selfattention_layer_type': 'rel_selfattn'}, 'decoder': 'transformer', 'decoder_conf': {'attention_heads': 4, 'linear_units': 2048, 'num_blocks': 6, 'dropout_rate': 0.1, 'positional_dropout_rate': 0.1, 'self_attention_dropout_rate': 0.0, 'src_attention_dropout_rate': 0.0}, 'tokenizer': 'char', 'tokenizer_conf': {'symbol_table_path': 'data/dict/lang_char.txt', 'split_with_space': False, 'bpe_path': None, 'non_lang_syms_path': None, 'is_multilingual': False, 'num_languages': 1, 'special_tokens': {'<blank>': 0, '<unk>': 1, '<sos>': 2, '<eos>': 2}}, 'ctc': 'ctc', 'ctc_conf': {'ctc_blank_id': 0}, 'cmvn': 'global_cmvn', 'cmvn_conf': {'cmvn_file': 'data/train/global_cmvn', 'is_json_cmvn': True}, 'model': 'asr_model', 'model_conf': {'ctc_weight': 0.3, 'lsm_weight': 0.1, 'length_normalized_loss': False}, 'dataset': 'asr', 'dataset_conf': {'filter_conf': {'max_length': 40960, 'min_length': 0, 'token_max_length': 200, 'token_min_length': 1}, 'resample_conf': {'resample_rate': 16000}, 'speed_perturb': True, 'fbank_conf': {'num_mel_bins': 80, 'frame_shift': 10, 'frame_length': 25, 'dither': 0.1}, 'spec_aug': True, 'spec_aug_conf': {'num_t_mask': 2, 'num_f_mask': 2, 'max_t': 50, 'max_f': 10}, 'shuffle': True, 'shuffle_conf': {'shuffle_size': 1500}, 'sort': True, 'sort_conf': {'sort_size': 500}, 'batch_conf': {'batch_type': 'static', 'batch_size': 16}}, 'grad_clip': 5, 'accum_grad': 4, 'max_epoch': 1, 'log_interval': 100, 'optim': 'adam', 'optim_conf': {'lr': 0.002}, 'scheduler': 'warmuplr', 'scheduler_conf': {'warmup_steps': 25000}, 'vocab_size': 4233, 'dtype': 'fp32', 'input_dim': 80, 'output_dim': 4233, 'train_engine': 'torch_ddp', 'use_amp': False, 'model_dir': '/home/cmq/code/wenet/examples/aishell/s0/exp/conformer', 'save_states': 'model_only', 'init_infos': {}}
2024-04-26 10:48:02,603 INFO Checkpoint: save to checkpoint /home/cmq/code/wenet/examples/aishell/s0/exp/conformer/init.pt
2024-04-26 10:48:03,299 INFO Epoch 0 TRAIN info lr 8e-08 rank 0
2024-04-26 10:48:03,304 INFO using accumulate grad, new batch size is 4 times larger than before
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1132, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/cmq/miniconda3/envs/wenet/lib/python3.10/multiprocessing/queues.py", line 113, in get
    if not self._poll(timeout):
  File "/home/cmq/miniconda3/envs/wenet/lib/python3.10/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/home/cmq/miniconda3/envs/wenet/lib/python3.10/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/home/cmq/miniconda3/envs/wenet/lib/python3.10/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/home/cmq/miniconda3/envs/wenet/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1054269) is killed by signal: Segmentation fault. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/cmq/code/wenet/examples/aishell/s0/wenet/bin/train.py", line 183, in <module>
    main()
  File "/home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/cmq/code/wenet/examples/aishell/s0/wenet/bin/train.py", line 149, in main
    executor.train(model, optimizer, scheduler, train_data_loader,
  File "/home/cmq/code/wenet/examples/aishell/s0/wenet/utils/executor.py", line 57, in train
    for batch_idx, batch_dict in enumerate(train_data_loader):
  File "/home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data
    idx, data = self._get_data()
  File "/home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1294, in _get_data
    success, data = self._try_get_data()
  File "/home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1145, in _try_get_data
    raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
RuntimeError: DataLoader worker (pid(s) 1054269) exited unexpectedly

The stack print out:

GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...
(No debugging symbols found in python)

warning: core file may not match specified executable file.
[New LWP 1054269]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Core was generated by `python wenet/bin/train.py --device cpu --train_engine torch_ddp --config /home/'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x0000ffffad7cb424 in handler_SIGSEGV(int, siginfo_t*, void*) () from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_python.so
#2  <signal handler called>
#3  0x0000ffffaed04ca0 in ?? () from /usr/lib/aarch64-linux-gnu/libgomp.so.1
#4  0x0000ffffaecfd6ec in GOMP_parallel () from /usr/lib/aarch64-linux-gnu/libgomp.so.1
#5  0x0000ffffa316f988 in arm_compute::OMPScheduler::run_workloads(std::vector<std::function<void (arm_compute::ThreadInfo const&)>, std::allocator<std::function<void (arm_compute::ThreadInfo const&)> > >&) ()
   from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/../../torch.libs/libarm_compute-556725d0.so
#6  0x0000ffffa3170174 in arm_compute::OMPScheduler::schedule_op(arm_compute::ICPPKernel*, arm_compute::IScheduler::Hints const&, arm_compute::Window const&, arm_compute::ITensorPack&) () from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/../../torch.libs/libarm_compute-556725d0.so
#7  0x0000ffffa3173d58 in arm_compute::experimental::INEOperator::run(arm_compute::ITensorPack&, arm_compute::Window const&) ()
   from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/../../torch.libs/libarm_compute-556725d0.so
#8  0x0000ffffa32e9648 in arm_compute::NETranspose::run() ()
   from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/../../torch.libs/libarm_compute-556725d0.so
#9  0x0000ffffaa4e4120 in dnnl::impl::cpu::aarch64::matmul::acl_matmul_t::execute_forward(dnnl::impl::exec_ctx_t const&) const ()
   from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#10 0x0000ffffa9ca00f8 in dnnl_primitive::execute(dnnl::impl::exec_ctx_t&) const ()
   from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#11 0x0000ffffa9ca077c in dnnl::impl::primitive_execute(dnnl_primitive const*, dnnl::impl::exec_ctx_t&) ()
   from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#12 0x0000ffffa9ca0a40 in dnnl_primitive_execute () from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#13 0x0000ffffa64e40a8 in dnnl::primitive::execute(dnnl::stream const&, std::unordered_map<int, dnnl::memory, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, dnnl::memory> > > const&) const () from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#14 0x0000ffffa6b8dd78 in void ideep::matmul_forward::do_compute<false, true, true>(ideep::matmul_forward_params const&, ideep::tensor const&, ideep::tensor const&, ideep::tensor const&, ideep::tensor&, std::vector<ideep::tensor, std::allocator<ideep::tensor> > const&) [clone .isra.0] ()
   from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#15 0x0000ffffa6b8f630 in void ideep::matmul_forward::compute_impl<false, true, true>(ideep::tensor const&, ideep::tensor const&, ideep::tensor const&, ideep::tensor&, std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, float, float, ideep::attr_t const&, std::vector<ideep::tensor, std::allocator<ideep::tensor> > const&, dnnl::memory::data_type, ideep::lowp_kind, ideep::engine const&) [clone .isra.0] ()
   from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#16 0x0000ffffa6b90fc8 in at::native::mkldnn_matmul(at::Tensor const&, at::Tensor const&, at::Tensor const&, float, float) ()
   from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#17 0x0000ffffa6659fc0 in at::native::addmm_impl_cpu_(at::Tensor&, at::Tensor const&, at::Tensor, at::Tensor, c10::Scalar const&, c10::Scalar const&) ()
   from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#18 0x0000ffffa665a458 in at::native::structured_mm_out_cpu::impl(at::Tensor const&, at::Tensor const&, at::Tensor const&) ()
   from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#19 0x0000ffffa72b0cfc in at::(anonymous namespace)::wrapper_CPU_mm(at::Tensor const&, at::Tensor const&) ()
--Type <RET> for more, q to quit, c to continue without paging--c
   from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#20 0x0000ffffa72b0d6c in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::wrapper_CPU_mm>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#21 0x0000ffffa7051e64 in at::_ops::mm::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#22 0x0000ffffa8797dd4 in torch::autograd::VariableType::(anonymous namespace)::mm(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#23 0x0000ffffa87986c4 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::mm>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#24 0x0000ffffa70a9eb8 in at::_ops::mm::call(at::Tensor const&, at::Tensor const&) () from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#25 0x0000ffffad67eee8 in torch::autograd::THPVariable_mm(_object*, _object*, _object*) () from /home/cmq/miniconda3/envs/wenet/lib/python3.10/site-packages/torch/lib/libtorch_python.so
#26 0x00000000005d5cfc in cfunction_call ()
#27 0x000000000043dee0 in _PyObject_MakeTpCall ()
#28 0x000000000042f1a8 in _PyEval_EvalFrameDefault ()
#29 0x00000000004d5874 in _PyEval_Vector ()
#30 0x000000000042bec0 in _PyEval_EvalFrameDefault ()
#31 0x00000000004d5874 in _PyEval_Vector ()
#32 0x000000000043daac in PyVectorcall_Call ()
#33 0x00000000005685c0 in partial_call ()
#34 0x000000000043dee0 in _PyObject_MakeTpCall ()
#35 0x000000000042f158 in _PyEval_EvalFrameDefault ()
#36 0x00000000004d5874 in _PyEval_Vector ()
#37 0x00000000005bd6e0 in method_vectorcall ()
#38 0x000000000042f158 in _PyEval_EvalFrameDefault ()
#39 0x00000000005ca0cc in gen_send_ex2 ()
#40 0x00000000005cb148 in gen_send ()
#41 0x00000000005c5370 in method_vectorcall_O ()
#42 0x000000000042c168 in _PyEval_EvalFrameDefault ()
#43 0x00000000005ca0cc in gen_send_ex2 ()
#44 0x00000000005cacf4 in gen_iternext ()
#45 0x0000000000426db0 in _PyEval_EvalFrameDefault ()
#46 0x00000000005ca0cc in gen_send_ex2 ()
#47 0x00000000005cb148 in gen_send ()
#48 0x00000000005c5370 in method_vectorcall_O ()
#49 0x000000000042c168 in _PyEval_EvalFrameDefault ()
#50 0x00000000005ca0cc in gen_send_ex2 ()
#51 0x00000000005cacf4 in gen_iternext ()
#52 0x0000000000426db0 in _PyEval_EvalFrameDefault ()
#53 0x00000000005ca0cc in gen_send_ex2 ()
#54 0x00000000005cb148 in gen_send ()
#55 0x00000000005c5370 in method_vectorcall_O ()
#56 0x000000000042c168 in _PyEval_EvalFrameDefault ()
#57 0x00000000005ca0cc in gen_send_ex2 ()
#58 0x00000000005cacf4 in gen_iternext ()
#59 0x0000000000426db0 in _PyEval_EvalFrameDefault ()
#60 0x00000000005ca0cc in gen_send_ex2 ()
#61 0x00000000005cb148 in gen_send ()
#62 0x00000000005c5370 in method_vectorcall_O ()
#63 0x000000000042c168 in _PyEval_EvalFrameDefault ()
#64 0x00000000005ca0cc in gen_send_ex2 ()
#65 0x00000000005cacf4 in gen_iternext ()
#66 0x0000000000426db0 in _PyEval_EvalFrameDefault ()
#67 0x00000000005ca0cc in gen_send_ex2 ()
#68 0x00000000005cb148 in gen_send ()
#69 0x00000000005c5370 in method_vectorcall_O ()
#70 0x000000000042c168 in _PyEval_EvalFrameDefault ()
#71 0x00000000005ca0cc in gen_send_ex2 ()
#72 0x00000000005cacf4 in gen_iternext ()
#73 0x0000000000426db0 in _PyEval_EvalFrameDefault ()
#74 0x00000000005ca0cc in gen_send_ex2 ()
#75 0x00000000005cb148 in gen_send ()
#76 0x00000000005c5370 in method_vectorcall_O ()
#77 0x000000000042c168 in _PyEval_EvalFrameDefault ()
#78 0x00000000005ca0cc in gen_send_ex2 ()
#79 0x00000000005cacf4 in gen_iternext ()
#80 0x0000000000426db0 in _PyEval_EvalFrameDefault ()
#81 0x00000000005ca0cc in gen_send_ex2 ()
#82 0x00000000005cb148 in gen_send ()
#83 0x00000000005c5370 in method_vectorcall_O ()
#84 0x000000000042c168 in _PyEval_EvalFrameDefault ()
#85 0x00000000005ca0cc in gen_send_ex2 ()
#86 0x00000000005cacf4 in gen_iternext ()
#87 0x00000000005ff574 in builtin_next ()
#88 0x00000000005d66a8 in cfunction_vectorcall_FASTCALL ()
#89 0x000000000042bdd4 in _PyEval_EvalFrameDefault ()
#90 0x00000000004d5874 in _PyEval_Vector ()
#91 0x0000000000428164 in _PyEval_EvalFrameDefault ()
#92 0x00000000004d5874 in _PyEval_Vector ()
#93 0x000000000048d800 in vectorcall_method ()
#94 0x000000000048e750 in slot_tp_iternext ()
#95 0x00000000005ff574 in builtin_next ()
#96 0x00000000005d66a8 in cfunction_vectorcall_FASTCALL ()
#97 0x000000000042bdd4 in _PyEval_EvalFrameDefault ()
#98 0x00000000004d5874 in _PyEval_Vector ()
#99 0x00000000005bd6e0 in method_vectorcall ()
#100 0x000000000042f158 in _PyEval_EvalFrameDefault ()
#101 0x00000000004d5874 in _PyEval_Vector ()
#102 0x000000000048d800 in vectorcall_method ()
#103 0x000000000048e750 in slot_tp_iternext ()
#104 0x00000000005ff574 in builtin_next ()
#105 0x00000000005d66a8 in cfunction_vectorcall_FASTCALL ()
#106 0x000000000042bdd4 in _PyEval_EvalFrameDefault ()
#107 0x00000000004d5874 in _PyEval_Vector ()
#108 0x000000000042c168 in _PyEval_EvalFrameDefault ()
#109 0x00000000004d5874 in _PyEval_Vector ()
#110 0x0000000000428164 in _PyEval_EvalFrameDefault ()
#111 0x00000000004d5874 in _PyEval_Vector ()
#112 0x000000000042c168 in _PyEval_EvalFrameDefault ()
#113 0x00000000004d5874 in _PyEval_Vector ()
#114 0x00000000005bd6e0 in method_vectorcall ()
#115 0x000000000042bec0 in _PyEval_EvalFrameDefault ()
#116 0x00000000004d5874 in _PyEval_Vector ()
#117 0x000000000042c168 in _PyEval_EvalFrameDefault ()
#118 0x00000000004d5874 in _PyEval_Vector ()
#119 0x000000000043e0bc in _PyObject_FastCallDictTstate ()
#120 0x000000000043e410 in _PyObject_Call_Prepend ()
#121 0x000000000048eeac in slot_tp_init ()
#122 0x00000000004871f0 in type_call ()
#123 0x000000000043dee0 in _PyObject_MakeTpCall ()
#124 0x000000000042e370 in _PyEval_EvalFrameDefault ()
#125 0x00000000004d5874 in _PyEval_Vector ()
#126 0x000000000042f158 in _PyEval_EvalFrameDefault ()
#127 0x00000000004d5874 in _PyEval_Vector ()
#128 0x000000000042f158 in _PyEval_EvalFrameDefault ()
#129 0x00000000004d5874 in _PyEval_Vector ()
#130 0x000000000042c168 in _PyEval_EvalFrameDefault ()
#131 0x00000000004d5874 in _PyEval_Vector ()
#132 0x000000000043e0bc in _PyObject_FastCallDictTstate ()
#133 0x000000000043e410 in _PyObject_Call_Prepend ()
#134 0x000000000048eeac in slot_tp_init ()
#135 0x00000000004871f0 in type_call ()
#136 0x000000000043dee0 in _PyObject_MakeTpCall ()
#137 0x000000000042e370 in _PyEval_EvalFrameDefault ()
#138 0x00000000004d5874 in _PyEval_Vector ()
#139 0x000000000042c168 in _PyEval_EvalFrameDefault ()
#140 0x00000000004d5874 in _PyEval_Vector ()
#141 0x0000000000483218 in call_unbound_noarg ()
#142 0x000000000048ae98 in slot_tp_iter ()
#143 0x00000000005af2ac in PyObject_GetIter ()
#144 0x00000000005c734c in enum_new ()
#145 0x0000000000487198 in type_call ()
#146 0x000000000043dee0 in _PyObject_MakeTpCall ()
#147 0x000000000042e370 in _PyEval_EvalFrameDefault ()
#148 0x00000000004d5874 in _PyEval_Vector ()
#149 0x000000000042c168 in _PyEval_EvalFrameDefault ()
#150 0x00000000004d5874 in _PyEval_Vector ()
#151 0x0000000000428164 in _PyEval_EvalFrameDefault ()
#152 0x00000000004d5874 in _PyEval_Vector ()
#153 0x000000000042bdd4 in _PyEval_EvalFrameDefault ()
#154 0x00000000004d5874 in _PyEval_Vector ()
#155 0x00000000004d5a70 in PyEval_EvalCode ()
#156 0x000000000051454c in run_eval_code_obj ()
#157 0x00000000005147e8 in run_mod ()
#158 0x0000000000514918 in pyrun_file ()
#159 0x0000000000516f34 in _PyRun_SimpleFileObject ()
#160 0x0000000000517510 in _PyRun_AnyFileObject ()
#161 0x000000000043128c in Py_RunMain ()
#162 0x0000000000431c24 in Py_BytesMain ()
#163 0x0000ffffaea94e10 in __libc_start_main (main=0x4258e0 <main>, argc=23, argv=0xffffce610d58, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>) at ../csu/libc-start.c:308
#164 0x00000000004304f0 in _start ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

To Reproduce
Steps to reproduce the behavior:

  1. build and activate python env on a mechine with Kunpeng CPU
  2. cd ./examples/aishell/s0
  3. bash run.sh When go to stage 4 (run trian.py), the segmentfault will happen.
  4. See error

Expected behavior
No fault.

Screenshots
image

Desktop (please complete the following information):

  • OS: [Ubuntu 20.04.6 LTS]
  • Browser [N/A]
  • Version [20.04.6 LTS]

Additional context
I have confirmed that this error is caused by the way of creating multiple processes. Specifying the multi-process context as spawn, just set multiprocessing_context=mp.get_context("spawn") in DataLoader, can solve the problem. And as far as I know, the method spawn works on the most systems (Windows, all POSIX platforms and macOS):
image

If this solution is approved, I will submit a PR. Let me know if you have any suggestion.

@Mddct
Copy link
Collaborator

Mddct commented Apr 27, 2024

pr welcome

MengqingCao added a commit to MengqingCao/wenet that referenced this issue Apr 28, 2024
MengqingCao added a commit to MengqingCao/wenet that referenced this issue Apr 28, 2024
MengqingCao added a commit to MengqingCao/wenet that referenced this issue Apr 28, 2024
xingchensong added a commit that referenced this issue May 8, 2024
xingchensong added a commit that referenced this issue May 8, 2024
MengqingCao added a commit to MengqingCao/wenet that referenced this issue May 15, 2024
MengqingCao added a commit to MengqingCao/wenet that referenced this issue May 16, 2024
- fix segmentfault in Kunpeng (wenet-e2e#2506)
- avoids the repeated initialization of deepspeed in (wenet-e2e#2507)
MengqingCao added a commit to MengqingCao/wenet that referenced this issue May 16, 2024
  - fix segmentfault in Kunpeng (wenet-e2e#2506)
  - avoids the repeated initialization of deepspeed causing by (wenet-e2e#2507)
xingchensong pushed a commit that referenced this issue May 17, 2024
- fix segmentfault in Kunpeng (#2506)
  - avoids the repeated initialization of deepspeed causing by (#2507)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants