RuntimeError: Specified device cuda:0 does not match device of data cuda:-2 #33

EmreOzkose · 2021-09-02T11:56:07Z

Hello,

I am training a TDNN-LSTM model with librispeech recipe on 16k 100 hours data. After training, I run decode.py. I sometimes observe a cuda issue (given below). Have you ever observe something like that? I think it is related to something during training. Because after some trainings, decode.py works well, however after some of trainings, decode.py gives this error. I googled RuntimeError: Specified device cuda:0 does not match device of data cuda:-2 error, but found nothing. I have Tesla-p100 16gb. I should also mention that 1best works well, but problem occurs during nbest and rescorings.

(k2) yunusemre.ozkose@boxx-3:/path/to/k2/icefall/egs/from_wav_scp/ASR$ python tdnn_lstm_ctc/decode.py --avg 1 --epoch 9
2021-09-02 14:24:46,677 INFO [decode.py:324] Decoding started
2021-09-02 14:24:46,678 INFO [decode.py:325] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp9_w2v2'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 1024, 'subsampling_factor': 1, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'method': 'nbest-rescoring', 'num_paths': 10, 'epoch': 9, 'avg': 1, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 500.0, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': True, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'full_libri': False}
2021-09-02 14:24:47,880 INFO [lexicon.py:96] Loading pre-compiled data/lang_phone/Linv.pt
2021-09-02 14:24:48,469 INFO [decode.py:334] device: cuda:0
2021-09-02 14:25:02,211 INFO [decode.py:362] Loading pre-compiled G_4_gram.pt
2021-09-02 14:25:02,846 INFO [checkpoint.py:75] Loading checkpoint from tdnn_lstm_ctc/exp9_w2v2/epoch-9.pt
/path/to/k2/lhotse/lhotse/dataset/sampling/single_cut.py:170: UserWarning: The first cut drawn in batch collection violates the max_frames or max_cuts constraints - we'll return it anyway. Consider increasing max_frames/max_cuts.
  warnings.warn(
2021-09-02 14:25:07,886 INFO [decode.py:271] batch 0, cuts processed until now is 1/171 (0.584795%)
Traceback (most recent call last):
  File "tdnn_lstm_ctc/decode.py", line 432, in <module>
    main()
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "tdnn_lstm_ctc/decode.py", line 415, in main
    results_dict = decode_dataset(
  File "tdnn_lstm_ctc/decode.py", line 250, in decode_dataset
    hyps_dict = decode_one_batch(
  File "tdnn_lstm_ctc/decode.py", line 190, in decode_one_batch
    best_path_dict = rescore_with_n_best_list(
  File "/path/to/k2/icefall/icefall/decode.py", line 405, in rescore_with_n_best_list
    am_scores, _ = compute_am_and_lm_scores(
  File "/path/to/k2/icefall/icefall/decode.py", line 297, in compute_am_and_lm_scores
    path_lattice = _intersect_device(
  File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device
    return k2.intersect_device(
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device
    out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas,
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor
    value = index_select(a_value, a_arc_map, default_value=filler) \
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 160, in index_select
    ans = _IndexSelectFunction.apply(src, index, default_value)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 66, in forward
    return _k2.index_select(src, index, default_value)
RuntimeError: Specified device cuda:0 does not match device of data cuda:-2
Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f41692162f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f416921367b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x28200 (0x7f40c8316200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x10e0a1 (0x7f40c83fc0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x84bce (0x7f40c8372bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x8858f (0x7f40c837658f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x9f876 (0x7f40c838d876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x1dfcf (0x7f40c830bfcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7f41c016d41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #54: __libc_start_main + 0xe7 (0x7f41f24cbb97 in /lib/x86_64-linux-gnu/libc.so.6)

The text was updated successfully, but these errors were encountered:

EmreOzkose · 2021-09-02T12:12:02Z

Note that it can be also a memory issue, because I have a small memory (16gb). However, If the problem was a memory issue, I would expect to observe an error like:

RuntimeError: CUDA out of memory. Tried to allocate 420.00 MiB (GPU 0; 15.90 GiB total capacity; 3.23 GiB already allocated; 168.75 MiB free; 3.56 GiB reserved in total by PyTorch)

danpovey · 2021-09-02T12:13:33Z

Perhaps it's trying to use >1 GPU somehow? (But it shouldn't). If that's the case, setting something likeCUDA_VISIBLE_DEVICES=0(or whatever)should address it.Another possibility is that cuda:-2 is not a real device but some kind of error code. That error message likely comes from torch.I think it would be worthwhile to try to catch the error in pdb, and print out the devices of all inputs to the function that failed.Once we know which object has the bad device, we can more easily debug.

csukuangfj · 2021-09-02T12:15:44Z

 File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 66, in forward
    return _k2.index_select(src, index, default_value)

Could you modify /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py , line 66,

print(src.device, index.device)
return _k2.index_select(src, index, default_value)

It may show something that is useful.

EmreOzkose · 2021-09-02T12:17:28Z

@csukuangfj I already printed devices before, but all of them was cuda:0.

EmreOzkose · 2021-09-02T12:18:21Z

@danpovey I have 4 devices, but before training, I am setting CUDA_VISIBLE_DEVICES=0. I will also try to debug with pdb.

EmreOzkose · 2021-09-02T12:48:13Z

I added try-catch block to function decode_one_batch() in decode.py as:

try:
    best_path = nbest_decoding(
        lattice=lattice,
        num_paths=params.num_paths,
        use_double_scores=params.use_double_scores,
    )
except:
    breakpoint()

when I run python -m pdb tdnn_lstm_ctc/decode.py --avg 1 --epoch 8:

(k2) yunusemre.ozkose@boxx-3:/path/to/k2/icefall/egs/from_wav_scp/ASR$ python -m pdb tdnn_lstm_ctc/decode.py --avg 1 --epoch 8
> /path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py(3)<module>()
-> import os
(Pdb) c
2021-09-02 15:43:01,990 INFO [decode.py:330] Decoding started
2021-09-02 15:43:01,990 INFO [decode.py:331] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp9_w2v2'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 1024, 'subsampling_factor': 3, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'method': 'nbest', 'num_paths': 30, 'max_frames': 1000, 'epoch': 8, 'avg': 1, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 500.0, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': True, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'full_libri': False}
2021-09-02 15:43:02,604 INFO [lexicon.py:96] Loading pre-compiled data/lang_phone/Linv.pt
2021-09-02 15:43:02,963 INFO [decode.py:340] device: cuda:0
2021-09-02 15:43:09,784 INFO [checkpoint.py:75] Loading checkpoint from tdnn_lstm_ctc/exp9_w2v2/epoch-8.pt
/path/to/k2/lhotse/lhotse/dataset/sampling/single_cut.py:170: UserWarning: The first cut drawn in batch collection violates the max_frames or max_cuts constraints - we'll return it anyway. Consider increasing max_frames/max_cuts.
  warnings.warn(
2021-09-02 15:43:11,389 INFO [decode.py:277] batch 0, cuts processed until now is 1/171 (0.584795%)
> /path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py(185)decode_one_batch()
-> key = f"no_rescore-{params.num_paths}"
(Pdb) lattice.device
device(type='cuda', index=0)
(Pdb)

Problem occurs in nbest_decoding(). Only lattice tensor is given to that function and its device is 0.

danpovey · 2021-09-02T13:06:42Z

I think you are not quite at the place where it failed-need to do "c" (continue) maybe?

EmreOzkose · 2021-09-02T13:36:18Z

When I didn't add a try-catch block, log is :

(k2) yunusemre.ozkose@boxx-3:/path/to/k2/icefall/egs/from_wav_scp/ASR$ python -m pdb tdnn_lstm_ctc/decode.py --avg 1 --epoch 8
> /path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py(3)<module>()
-> import os
(Pdb) c
2021-09-02 16:33:33,700 INFO [decode.py:327] Decoding started
2021-09-02 16:33:33,701 INFO [decode.py:328] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp9_w2v2'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 1024, 'subsampling_factor': 3, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'method': 'nbest', 'num_paths': 30, 'max_frames': 1000, 'epoch': 8, 'avg': 1, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 500.0, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': True, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'full_libri': False}
2021-09-02 16:33:34,178 INFO [lexicon.py:96] Loading pre-compiled data/lang_phone/Linv.pt
2021-09-02 16:33:34,494 INFO [decode.py:337] device: cuda:0
2021-09-02 16:33:45,349 INFO [checkpoint.py:75] Loading checkpoint from tdnn_lstm_ctc/exp9_w2v2/epoch-8.pt
/path/to/k2/lhotse/lhotse/dataset/sampling/single_cut.py:170: UserWarning: The first cut drawn in batch collection violates the max_frames or max_cuts constraints - we'll return it anyway. Consider increasing max_frames/max_cuts.
  warnings.warn(
2021-09-02 16:33:47,481 INFO [decode.py:274] batch 0, cuts processed until now is 1/171 (0.584795%)
Traceback (most recent call last):
  File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1705, in main
    pdb._runscript(mainpyfile)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1573, in _runscript
    self.run(statement)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/bdb.py", line 580, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 3, in <module>
    import os
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 418, in main
    results_dict = decode_dataset(
  File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 253, in decode_dataset
    hyps_dict = decode_one_batch(
  File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 176, in decode_one_batch
    best_path = nbest_decoding(
  File "/path/to/k2/icefall/icefall/decode.py", line 208, in nbest_decoding
    path_lattice = _intersect_device(
  File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device
    return k2.intersect_device(
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device
    out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas,
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor
    value = index_select(a_value, a_arc_map, default_value=filler) \
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 160, in index_select
    ans = _IndexSelectFunction.apply(src, index, default_value)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 66, in forward
    return _k2.index_select(src, index, default_value)
RuntimeError: Specified device cuda:0 does not match device of data cuda:-2
Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f359b5e32f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f359b5e067b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x28200 (0x7f34fa699200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x10e0a1 (0x7f34fa77f0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x84bce (0x7f34fa6f5bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x8858f (0x7f34fa6f958f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x9f876 (0x7f34fa710876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x1dfcf (0x7f34fa68efcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7f35f253a41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(66)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) lattice.device
*** NameError: name 'lattice' is not defined
(Pdb)

I can't reach lattice after error, hence I added try-catch block.

EmreOzkose · 2021-09-02T13:54:03Z

I added breakpoint to place where @csukuangfj said. Log is here:

(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
Traceback (most recent call last):
  File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1705, in main
    pdb._runscript(mainpyfile)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1573, in _runscript
    self.run(statement)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/bdb.py", line 580, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 435, in <module>
    main()
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 418, in main
    results_dict = decode_dataset(
  File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 253, in decode_dataset
    hyps_dict = decode_one_batch(
  File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 176, in decode_one_batch
    best_path = nbest_decoding(
  File "/path/to/k2/icefall/icefall/decode.py", line 208, in nbest_decoding
    path_lattice = _intersect_device(
  File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device
    return k2.intersect_device(
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device
    out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas,
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor
    value = index_select(a_value, a_arc_map, default_value=filler) \
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 161, in index_select
    ans = _IndexSelectFunction.apply(src, index, default_value)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 67, in forward
    return _k2.index_select(src, index, default_value)
RuntimeError: Specified device cuda:0 does not match device of data cuda:-2
Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fe9a54c82f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fe9a54c567b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x28200 (0x7fe904576200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x10e0a1 (0x7fe90465c0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x84bce (0x7fe9045d2bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x8858f (0x7fe9045d658f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x9f876 (0x7fe9045ed876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x1dfcf (0x7fe90456bfcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7fe9fc41f41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb)

the place in miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py :

65: breakpoint()
66: return _k2.index_select(src, index, default_value)

danpovey · 2021-09-02T14:02:21Z

It might be possible to catch the exception in gdb by doing: gdb --args python3 whatever.py (gdb) catch throw (gdb) r ...

…

On Thu, Sep 2, 2021 at 9:54 PM Yunusemre ***@***.***> wrote: I added breakpoint to place where @csukuangfj <https://github.com/csukuangfj> said. Log is here: (Pdb) c > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb) src.device; index.device; default_value; device(type='cuda', index=0) device(type='cuda', index=0) 0.0 (Pdb) c > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb) src.device; index.device; default_value; device(type='cuda', index=0) device(type='cuda', index=0) 0.0 (Pdb) c > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb) src.device; index.device; default_value; device(type='cuda', index=0) device(type='cuda', index=0) 0.0 (Pdb) c > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb) src.device; index.device; default_value; device(type='cuda', index=0) device(type='cuda', index=0) 0.0 (Pdb) c > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb) src.device; index.device; default_value; device(type='cuda', index=0) device(type='cuda', index=0) 0.0 (Pdb) c Traceback (most recent call last): File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1705, in main pdb._runscript(mainpyfile) File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1573, in _runscript self.run(statement) File "/path/to/miniconda3/envs/k2/lib/python3.8/bdb.py", line 580, in run exec(cmd, globals, locals) File "<string>", line 1, in <module> File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 435, in <module> main() File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 418, in main results_dict = decode_dataset( File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 253, in decode_dataset hyps_dict = decode_one_batch( File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 176, in decode_one_batch best_path = nbest_decoding( File "/path/to/k2/icefall/icefall/decode.py", line 208, in nbest_decoding path_lattice = _intersect_device( File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device return k2.intersect_device( File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas, File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor value = index_select(a_value, a_arc_map, default_value=filler) \ File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 161, in index_select ans = _IndexSelectFunction.apply(src, index, default_value) File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 67, in forward return _k2.index_select(src, index, default_value) RuntimeError: Specified device cuda:0 does not match device of data cuda:-2 Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fe9a54c82f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fe9a54c567b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: <unknown function> + 0x28200 (0x7fe904576200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #3: <unknown function> + 0x10e0a1 (0x7fe90465c0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #4: <unknown function> + 0x84bce (0x7fe9045d2bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #5: <unknown function> + 0x8858f (0x7fe9045d658f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #6: <unknown function> + 0x9f876 (0x7fe9045ed876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) frame #7: <unknown function> + 0x1dfcf (0x7fe90456bfcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) <omitting python frames> frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7fe9fc41f41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so) Uncaught exception. Entering post mortem debugging Running 'cont' or 'step' will restart the program > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() -> return _k2.index_select(src, index, default_value) (Pdb) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLOZOQ7YW7B6MVE3R5CTT756YLANCNFSM5DI5NN6Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

danpovey · 2021-09-02T14:02:38Z

... running with a debug version of k2 would help, there, though.

…

On Thu, Sep 2, 2021 at 10:02 PM Daniel Povey ***@***.***> wrote: It might be possible to catch the exception in gdb by doing: gdb --args python3 whatever.py (gdb) catch throw (gdb) r ... On Thu, Sep 2, 2021 at 9:54 PM Yunusemre ***@***.***> wrote: > I added breakpoint to place where @csukuangfj > <https://github.com/csukuangfj> said. Log is here: > > (Pdb) c > > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() > -> return _k2.index_select(src, index, default_value) > (Pdb) src.device; index.device; default_value; > device(type='cuda', index=0) > device(type='cuda', index=0) > 0.0 > (Pdb) c > > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() > -> return _k2.index_select(src, index, default_value) > (Pdb) src.device; index.device; default_value; > device(type='cuda', index=0) > device(type='cuda', index=0) > 0.0 > (Pdb) c > > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() > -> return _k2.index_select(src, index, default_value) > (Pdb) src.device; index.device; default_value; > device(type='cuda', index=0) > device(type='cuda', index=0) > 0.0 > (Pdb) c > > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() > -> return _k2.index_select(src, index, default_value) > (Pdb) src.device; index.device; default_value; > device(type='cuda', index=0) > device(type='cuda', index=0) > 0.0 > (Pdb) c > > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() > -> return _k2.index_select(src, index, default_value) > (Pdb) src.device; index.device; default_value; > device(type='cuda', index=0) > device(type='cuda', index=0) > 0.0 > (Pdb) c > Traceback (most recent call last): > File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1705, in main > pdb._runscript(mainpyfile) > File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1573, in _runscript > self.run(statement) > File "/path/to/miniconda3/envs/k2/lib/python3.8/bdb.py", line 580, in run > exec(cmd, globals, locals) > File "<string>", line 1, in <module> > File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 435, in <module> > main() > File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context > return func(*args, **kwargs) > File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 418, in main > results_dict = decode_dataset( > File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 253, in decode_dataset > hyps_dict = decode_one_batch( > File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 176, in decode_one_batch > best_path = nbest_decoding( > File "/path/to/k2/icefall/icefall/decode.py", line 208, in nbest_decoding > path_lattice = _intersect_device( > File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device > return k2.intersect_device( > File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device > out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas, > File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor > value = index_select(a_value, a_arc_map, default_value=filler) \ > File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 161, in index_select > ans = _IndexSelectFunction.apply(src, index, default_value) > File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 67, in forward > return _k2.index_select(src, index, default_value) > RuntimeError: Specified device cuda:0 does not match device of data cuda:-2 > Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first): > frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fe9a54c82f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so) > frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fe9a54c567b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so) > frame #2: <unknown function> + 0x28200 (0x7fe904576200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) > frame #3: <unknown function> + 0x10e0a1 (0x7fe90465c0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) > frame #4: <unknown function> + 0x84bce (0x7fe9045d2bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) > frame #5: <unknown function> + 0x8858f (0x7fe9045d658f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) > frame #6: <unknown function> + 0x9f876 (0x7fe9045ed876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) > frame #7: <unknown function> + 0x1dfcf (0x7fe90456bfcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so) > <omitting python frames> > frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7fe9fc41f41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so) > > Uncaught exception. Entering post mortem debugging > Running 'cont' or 'step' will restart the program > > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward() > -> return _k2.index_select(src, index, default_value) > (Pdb) > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#33 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAZFLOZOQ7YW7B6MVE3R5CTT756YLANCNFSM5DI5NN6Q> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > >

csukuangfj · 2021-09-02T14:11:05Z

https://k2.readthedocs.io/en/latest/installation/for_developers.html

The above link contains instructions to build a debug version of k2.

csukuangfj · 2021-09-02T14:12:23Z

I added breakpoint to place where @csukuangfj said. Log is here:

Could you also print the shape of src and index?

print(src.shape)
print(index.shape)

to verify that neither of them is empty?

EmreOzkose · 2021-09-03T05:41:04Z

I checked if index or src is empty, and noticed that index is empty when the problem occurs.

(k2) yunusemre.ozkose@boxx-3:/path/to/k2/icefall/egs/from_wav_scp/ASR$ python tdnn_lstm_ctc/decode.py --avg 1 --epoch 8
2021-09-03 08:14:46,220 INFO [decode.py:327] Decoding started
2021-09-03 08:14:46,220 INFO [decode.py:328] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp9_w2v2'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 1024, 'subsampling_factor': 3, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'method': 'nbest', 'num_paths': 30, 'max_frames': 1000, 'epoch': 8, 'avg': 1, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 500.0, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': True, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'full_libri': False}
2021-09-03 08:14:46,837 INFO [lexicon.py:96] Loading pre-compiled data/lang_phone/Linv.pt
2021-09-03 08:14:47,150 INFO [decode.py:337] device: cuda:0
2021-09-03 08:14:55,636 INFO [checkpoint.py:75] Loading checkpoint from tdnn_lstm_ctc/exp9_w2v2/epoch-8.pt
/path/to/k2/lhotse/lhotse/dataset/sampling/single_cut.py:170: UserWarning: The first cut drawn in batch collection violates the max_frames or max_cuts constraints - we'll return it anyway. Consider increasing max_frames/max_cuts.
  warnings.warn(
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([729618])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([15309908])
cuda:0 cuda:0
torch.Size([562]) torch.Size([15309908])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([15309908])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([15309908])
cuda:0 cuda:0
torch.Size([729618]) torch.Size([15309908])
cuda:0 cuda:0
torch.Size([15309908]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([15309908]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([15309908]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([106588]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([106588]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([106588]) torch.Size([106588])
cuda:0 cuda:0
torch.Size([30]) torch.Size([1])
2021-09-03 08:14:57,654 INFO [decode.py:274] batch 0, cuts processed until now is 1/171 (0.584795%)
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([1375261])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([36303965])
cuda:0 cuda:0
torch.Size([2322]) torch.Size([36303965])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([36303965])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([36303965])
cuda:0 cuda:0
torch.Size([1375261]) torch.Size([36303965])
cuda:0 cuda:0
torch.Size([36303965]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([36303965]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([36303965]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([178240]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([178240]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([178240]) torch.Size([178240])
cuda:0 cuda:0
torch.Size([30]) torch.Size([1])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([749184])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([21094213])
cuda:0 cuda:0
torch.Size([1308]) torch.Size([21094213])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([21094213])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([21094213])
cuda:0 cuda:0
torch.Size([749184]) torch.Size([21094213])
cuda:0 cuda:0
torch.Size([21094213]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([21094213]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([21094213]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([101191]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([101191]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([101191]) torch.Size([101191])
cuda:0 cuda:0
torch.Size([30]) torch.Size([1])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([36466453]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([183094])
cuda:0 cuda:0
torch.Size([183094]) torch.Size([0])
Traceback (most recent call last):
  File "tdnn_lstm_ctc/decode.py", line 435, in <module>
    main()
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "tdnn_lstm_ctc/decode.py", line 418, in main
    results_dict = decode_dataset(
  File "tdnn_lstm_ctc/decode.py", line 253, in decode_dataset
    hyps_dict = decode_one_batch(
  File "tdnn_lstm_ctc/decode.py", line 176, in decode_one_batch
    best_path = nbest_decoding(
  File "/path/to/k2/icefall/icefall/decode.py", line 208, in nbest_decoding
    path_lattice = _intersect_device(
  File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device
    return k2.intersect_device(
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device
    out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas,
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor
    value = index_select(a_value, a_arc_map, default_value=filler) \
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 163, in index_select
    ans = _IndexSelectFunction.apply(src, index, default_value)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 69, in forward
    return _k2.index_select(src, index, default_value)
RuntimeError: Specified device cuda:0 does not match device of data cuda:-2
Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f42803f82f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f42803f567b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x28200 (0x7f41df4f8200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x10e0a1 (0x7f41df5de0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x84bce (0x7f41df554bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x8858f (0x7f41df55858f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x9f876 (0x7f41df56f876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x1dfcf (0x7f41df4edfcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7f42d734f41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #52: __libc_start_main + 0xe7 (0x7f43096adb97 in /lib/x86_64-linux-gnu/libc.so.6)

csukuangfj · 2021-09-03T05:56:10Z

@EmreOzkose
Could you show us the version of k2 you are using?

$ python3 -m k2.version

should give you such information.

EmreOzkose · 2021-09-03T06:09:49Z

@csukuangfj
My version info is :

Collecting environment information...

k2 version: 1.3
Build type: Release
Git SHA1: 6b8a10fa95213da285b8fce6525b2c5ed42198a6
Git date: Tue Aug 3 05:36:48 2021
Cuda used to build k2: 11.1
cuDNN used to build k2: 8.0.5
Python version used to build k2: 3.8
OS used to build k2: Ubuntu 16.04.7 LTS
CMake version: 3.18.4
GCC version: 5.5.0
CMAKE_CUDA_FLAGS:  --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas --compiler-options -Wno-strict-overflow
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-strict-overflow
PyTorch version used to build k2: 1.8.1
PyTorch is using Cuda: 11.1
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False

I think I understand the issue. I am trying different architectures and features. Since my memory is small, when I increase number of layer of the model, I have to decrease max_frames. When I use small number of frames (like 5000), index comes 0 for some batches.

csukuangfj · 2021-09-03T06:24:55Z

I would recommend you to update your k2.

k2 v1.6 contains several bug fixes, including the one you are facing, I think.
As you are using conda, steps to update k2 are fairly simple. Please see
https://k2.readthedocs.io/en/latest/installation/conda.html

EmreOzkose · 2021-09-03T06:26:30Z

Thank you so much! I am updating at once.

EmreOzkose · 2021-09-03T06:57:11Z

I want to report here. I updated k2 and run decode.py again. The problem is not occurring now, thank you. However hyps are coming empty :). After now, it is my design's problem :).

EmreOzkose closed this as completed Sep 3, 2021

EmreOzkose reopened this Sep 3, 2021

EmreOzkose closed this as completed Sep 3, 2021

danpovey mentioned this issue Nov 27, 2021

Decoding error 'Fsa' object doesn't support assignment. #133

Open

ahazned mentioned this issue Apr 13, 2022

Illegal memory error when training with multi-GPU #247

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Specified device cuda:0 does not match device of data cuda:-2 #33

RuntimeError: Specified device cuda:0 does not match device of data cuda:-2 #33

EmreOzkose commented Sep 2, 2021

EmreOzkose commented Sep 2, 2021

danpovey commented Sep 2, 2021

csukuangfj commented Sep 2, 2021

EmreOzkose commented Sep 2, 2021

EmreOzkose commented Sep 2, 2021 •

edited

Loading

EmreOzkose commented Sep 2, 2021

danpovey commented Sep 2, 2021

EmreOzkose commented Sep 2, 2021

EmreOzkose commented Sep 2, 2021 •

edited

Loading

danpovey commented Sep 2, 2021 via email

danpovey commented Sep 2, 2021 via email

csukuangfj commented Sep 2, 2021

csukuangfj commented Sep 2, 2021

EmreOzkose commented Sep 3, 2021

csukuangfj commented Sep 3, 2021

EmreOzkose commented Sep 3, 2021

csukuangfj commented Sep 3, 2021

EmreOzkose commented Sep 3, 2021

EmreOzkose commented Sep 3, 2021

RuntimeError: Specified device cuda:0 does not match device of data cuda:-2 #33

RuntimeError: Specified device cuda:0 does not match device of data cuda:-2 #33

Comments

EmreOzkose commented Sep 2, 2021

EmreOzkose commented Sep 2, 2021

danpovey commented Sep 2, 2021

csukuangfj commented Sep 2, 2021

EmreOzkose commented Sep 2, 2021

EmreOzkose commented Sep 2, 2021 • edited Loading

EmreOzkose commented Sep 2, 2021

danpovey commented Sep 2, 2021

EmreOzkose commented Sep 2, 2021

EmreOzkose commented Sep 2, 2021 • edited Loading

danpovey commented Sep 2, 2021 via email

danpovey commented Sep 2, 2021 via email

csukuangfj commented Sep 2, 2021

csukuangfj commented Sep 2, 2021

EmreOzkose commented Sep 3, 2021

csukuangfj commented Sep 3, 2021

EmreOzkose commented Sep 3, 2021

csukuangfj commented Sep 3, 2021

EmreOzkose commented Sep 3, 2021

EmreOzkose commented Sep 3, 2021

EmreOzkose commented Sep 2, 2021 •

edited

Loading

EmreOzkose commented Sep 2, 2021 •

edited

Loading