Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash at exit #164

Closed
albertz opened this issue Dec 18, 2023 · 6 comments · Fixed by #232 or #234
Closed

Crash at exit #164

albertz opened this issue Dec 18, 2023 · 6 comments · Fixed by #232 or #234

Comments

@albertz
Copy link
Member

albertz commented Dec 18, 2023

Sometimes, but not always (maybe 20% of the cases?), when I hit Ctrl+C, I get this crash:

^C[2023-12-18 18:53:21,090] INFO: Got user interrupt signal stop engine and exit                                                                        [2023-12-18 18:53:21,090] WARNING: Main thread exit. Still running non-daemon threads: {<LocalEngine(Thread-1, started 140176269506112)>}               
[2023-12-18 18:53:21,665] ERROR: Exception in thread <DummyProcess(Thread-12 (worker), started daemon 140175636158016)>:                                [2023-12-18 18:53:21,666] ERROR: Exception in thread <DummyProcess(Thread-18 (worker), started daemon 140175107679808)>:                                
[2023-12-18 18:53:21,734] ERROR: Exception in thread <DummyProcess(Thread-14 (worker), started daemon 140175619372608)>:                                [2023-12-18 18:53:21,734] ERROR: Exception in thread <DummyProcess(Thread-7 (worker), started daemon 140176156243520)>:                                 
[2023-12-18 18:53:21,734] ERROR: Exception in thread <DummyProcess(Thread-6 (worker), started daemon 140176164636224)>:                                 [2023-12-18 18:53:21,776] ERROR: Exception in thread <DummyProcess(Thread-15 (worker), started daemon 140175610979904)>:                                
[2023-12-18 18:53:21,817] ERROR: Exception in thread <DummyProcess(Thread-3 (worker), started daemon 140176189814336)>:                                 
[2023-12-18 18:53:21,858] ERROR: Exception in thread <DummyProcess(Thread-9 (worker), started daemon 140176139458112)>:                                 
[2023-12-18 18:53:21,858] ERROR: Exception in thread <DummyProcess(Thread-4 (worker), started daemon 140176181421632)>:                                 [2023-12-18 18:53:21,859] ERROR: Exception in thread <DummyProcess(Thread-13 (worker), started daemon 140175627765312)>:
EXCEPTION
Traceback (most recent call last):
(Exclude vars because we are exiting.) 
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 311, in default_handle_exception_interrupt_main_thread.<locals>.wrap
ped_func
EXCEPTION
Traceback (most recent call last):
[2023-12-18 18:53:21,859] ERROR: Exception in thread <DummyProcess(Thread-11 (worker), started daemon 140175644550720)>:
EXCEPTION
Traceback (most recent call last):
EXCEPTION
    line: return func(*args, **kwargs)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 570, in SISGraph.for_all_nodes.<locals>.runner_helper
    line: runner(path.creator)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 547, in SISGraph.for_all_nodes.<locals>.runner
EXCEPTION
(Exclude vars because we are exiting.) 
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 311, in default_handle_exception_interrupt_main_thread.<locals>.wrap
ped_func
    line: return func(*args, **kwargs)
EXCEPTION
EXCEPTION
Traceback (most recent call last):
Traceback (most recent call last):
(Exclude vars because we are exiting.) 
EXCEPTION
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 570, in SISGraph.for_all_nodes.<locals>.runner_helper
    line: runner(path.creator)
EXCEPTION
Traceback (most recent call last):
Traceback (most recent call last):
(Exclude vars because we are exiting.) 
(Exclude vars because we are exiting.) 
...
    line: self._check_running()                                                                                                                         
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 353, in Pool._check_running                     
    line: raise ValueError("Pool not running")                                                                                                          ValueError: Pool not running                                                                                                                            
    line: self._check_running()                                                                                                                           File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 353, in Pool._check_running                     
Exception ignored in atexit callback: <function shutdown at 0x7f7d659ae5c0>                                                                             
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 458, in Pool.apply_async                        
    line: self._check_running()                                                                                                                         
EXCEPTION                                                                                                                                               
Traceback (most recent call last):                                                                                                                      
EXCEPTION                                                                                                                                               
Traceback (most recent call last):                                                                                                                      
(Exclude vars because we are exiting.)                                                                                                                  
    line: raise ValueError("Pool not running")                                                                                                          
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 353, in Pool._check_running                     
Exception ignored in sys.unraisablehook: <built-in function unraisablehook>                                                                             (Exclude vars because we are exiting.)                                                                                                                  
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 311, in default_handle_exception_interrupt_main_thread.<locals>.wrapped_func                                                                                                                                                
KeyboardInterrupt                                                                                                                                       Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stderr>'> at interpreter shutdown, possibly due to daemon threads                                                                                                                                               
Python runtime state: finalizing (tstate=0x00007f7d668932d8)                                                                                            
                                                                                                                                                        
Current thread 0x00007f7d66080000 (most recent call first):                                                                                             
  <no Python frame>                                                                                                                                     
                                                                                                                                                        
Extension modules: psutil._psutil_linux, psutil._psutil_posix, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, n
umpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils
, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5p
y.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, markupsafe._speedups, _cffi_backend (total: 41)                                                  
fish: Job 2, '/work/tools/users/zeyer/py-envs…' terminated by signal SIGABRT (Abort)     
@albertz
Copy link
Member Author

albertz commented Dec 18, 2023

The scrambled output means that there are many processes here stopped at the same time by SIGINT.

@critias
Copy link
Contributor

critias commented Jan 2, 2024

The graph computations are using a ThreadPool (https://github.com/rwth-i6/sisyphus/blob/master/sisyphus/graph.py#L232C12-L232C12).
I guess you get this output if you hit Ctrl-C while these computations are running. This problem might go away if you set gs.GRAPH_WORKER=1, but you would also use the multithreading speed up if your filesystem has a higher latency.

@albertz
Copy link
Member Author

albertz commented Jan 2, 2024

Are you saying GRAPH_WORKER=1 is anyway always better and we can remove the old code which handles GRAPH_WORKER>1?

I'm not searching for workarounds. Also, I could simply just ignore this message.

I simply report this because I think it's bad if the process crashes with terminated by signal SIGABRT, and maybe this should be investigated further.

@critias
Copy link
Contributor

critias commented Jan 2, 2024

No, I'm not saying GRAPH_WORKER=1 is better, it's just a workaround which in most cases makes sisyphus slower.

@albertz
Copy link
Member Author

albertz commented Dec 16, 2024

I get this now very frequently (I don't remember the last time that Sisyphus quit without this error). Even at normal exit:

...
[2024-12-14 07:11:19,901] INFO: There is nothing I can do, good bye!                                                                          
[2024-12-14 07:11:19,945] ERROR: Exception in thread <DummyProcess(Thread-36 (worker), started daemon 140326949860928)>:                      
[2024-12-14 07:11:20,060] ERROR: Exception in thread <DummyProcess(Thread-39 (worker), started daemon 140326446560832)>:                      
[2024-12-14 07:11:20,138] ERROR: Exception in thread <DummyProcess(Thread-38 (worker), started daemon 140326933075520)>:                      
[2024-12-14 07:11:20,139] ERROR: Exception in thread <DummyProcess(Thread-37 (worker), started daemon 140326941468224)>:                      
[2024-12-14 07:11:20,370] ERROR: Exception in thread <DummyProcess(Thread-40 (worker), started daemon 140326438168128)>:                      
EXCEPTION                                                                                                                                     
Traceback (most recent call last):                                                                                                            
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 303, in default_handle_exception_interrupt_main_thread.<lo
cals>.wrapped_func                                                                                                                            
EXCEPTION                                                                                                                                     Traceback (most recent call last):                                                                                                            
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 303, in default_handle_exception_interrupt_main_thread.<lo
cals>.wrapped_func                                                                                                                            
    line: return func(*args, **kwargs)                                                                                                        
    locals:                                                                                                                                   
      func = <local> <function SISGraph.for_all_nodes.<locals>.runner_helper at 0x7fa031936660>                                               
      args = <local> (Job<work/i6_core/returnn/search/SearchWordsDummyTimesToCTMJob.VIMLbcW4JZak>,)                                                 kwargs = <local> {}                                                                                                                     
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 582, in SISGraph.for_all_nodes.<locals>.runner_helper     
    line: runner(path.creator)                                                                                                                
EXCEPTION                                                                                                                                     
Traceback (most recent call last):                                                                                                            
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 303, in default_handle_exception_interrupt_main_thread.<lo
cals>.wrapped_func                                                                                                                            
EXCEPTION
    line: return func(*args, **kwargs)
    locals:
Traceback (most recent call last):
    line: return func(*args, **kwargs)
      func = <local> <function SISGraph.for_all_nodes.<locals>.runner_helper at 0x7fa031936660>
      args = <local> (Job<work/i6_core/returnn/search/SearchWordsDummyTimesToCTMJob.SG7NWmv1IGUk>,)
      kwargs = <local> {}
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 582, in SISGraph.for_all_nodes.<locals>.runner_helper
    line: runner(path.creator)
    locals:
      runner = <local> <function SISGraph.for_all_nodes.<locals>.runner at 0x7fa031935ee0>
      path = <local> <Path /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/search/SearchTakeBestJob.LBaWONwJQbAo/output/best_search_r
esults.py.gz>
EXCEPTION
Traceback (most recent call last):
    locals:
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 303, in default_handle_exception_interrupt_main_thread.<lo
cals>.wrapped_func
    line: return func(*args, **kwargs)
    locals:
      runner = <local> <function SISGraph.for_all_nodes.<locals>.runner at 0x7fa031935ee0>
      path.creator = <local> Job<work/i6_core/returnn/search/SearchTakeBestJob.LBaWONwJQbAo>
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 559, in SISGraph.for_all_nodes.<locals>.runner
      path = <local> <Path /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/search/SearchTakeBestJob.fSgCrTuB3AKa/output/best_search_r
esults.py.gz>
    line: visited[sis_id] = pool.apply_async(
              tools.default_handle_exception_interrupt_main_thread(runner_helper), (job,)
          )
    locals:
      path.creator = <local> Job<work/i6_core/returnn/search/SearchTakeBestJob.fSgCrTuB3AKa>
      func = <local> <function SISGraph.for_all_nodes.<locals>.runner_helper at 0x7fa031936660>
      args = <local> (Job<work/i6_core/returnn/search/SearchWordsDummyTimesToCTMJob.tt8AD04eEtv6>,)
      visited = <local> {'i6_experiments/users/zeyer/recog/GetBestRecogTrainExp.BZritgYEgoDE': <multiprocessing.pool.ApplyResult object at 0x7
fa0302aec10>, 'i6_experiments/users/zeyer/datasets/score_results/JoinScoreResultsJob.SzY46iK1xF1G': <multiprocessing.pool.ApplyResult object a
t 0x7fa032bdab50>, 'i6_experiments/us..., len = 15735
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 303, in default_handle_exception_interrupt_main_thread.<lo
cals>.wrapped_func
    line: return func(*args, **kwargs)
      sis_id = <local> 'i6_core/returnn/search/SearchTakeBestJob.LBaWONwJQbAo', len = 53
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 559, in SISGraph.for_all_nodes.<locals>.runner
      pool = <local> <multiprocessing.pool.ThreadPool state=CLOSE pool_size=5>
      kwargs = <local> {}
    locals:
      func = <local> <function SISGraph.for_all_nodes.<locals>.runner_helper at 0x7fa031936660>
    line: visited[sis_id] = pool.apply_async(
              tools.default_handle_exception_interrupt_main_thread(runner_helper), (job,)
          )
    locals:
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 582, in SISGraph.for_all_nodes.<locals>.runner_helper
      pool.apply_async = <local> <bound method Pool.apply_async of <multiprocessing.pool.ThreadPool state=CLOSE pool_size=5>>
      args = <local> (Job<work/i6_core/returnn/search/SearchWordsDummyTimesToCTMJob.21nktv9fWgnO>,)
    line: runner(path.creator)
      tools = <global> <module 'sisyphus.tools' from '/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py'>
    locals:
      func = <local> <function SISGraph.for_all_nodes.<locals>.runner_helper at 0x7fa031936660>
      visited = <local> {'i6_experiments/users/zeyer/recog/GetBestRecogTrainExp.BZritgYEgoDE': <multiprocessing.pool.ApplyResult object at 0x7
fa0302aec10>, 'i6_experiments/users/zeyer/datasets/score_results/JoinScoreResultsJob.SzY46iK1xF1G': <multiprocessing.pool.ApplyResult object a
t 0x7fa032bdab50>, 'i6_experiments/us..., len = 15735
      kwargs = <local> {}
      args = <local> (Job<work/i6_core/returnn/search/SearchWordsDummyTimesToCTMJob.9vsJ45Z27Hd3>,)
    locals:
      tools.default_handle_exception_interrupt_main_thread = <global> <function default_handle_exception_interrupt_main_thread at 0x7fa1190e20
c0>
      sis_id = <local> 'i6_core/returnn/search/SearchTakeBestJob.fSgCrTuB3AKa', len = 53
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 582, in SISGraph.for_all_nodes.<locals>.runner_helper
      runner_helper = <local> <function SISGraph.for_all_nodes.<locals>.runner_helper at 0x7fa031936660>
      pool = <local> <multiprocessing.pool.ThreadPool state=CLOSE pool_size=5>
    line: runner(path.creator)
      job = <local> Job<work/i6_core/returnn/search/SearchTakeBestJob.LBaWONwJQbAo>
      pool.apply_async = <local> <bound method Pool.apply_async of <multiprocessing.pool.ThreadPool state=CLOSE pool_size=5>>
    locals:
      runner = <local> <function SISGraph.for_all_nodes.<locals>.runner at 0x7fa031935ee0>
      kwargs = <local> {}
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 458, in Pool.apply_async
    line: self._check_running()
      path = <local> <Path /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/search/SearchTakeBestJob.GhRjEqlCh70Z/output/best_search_r
esults.py.gz>
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 582, in SISGraph.for_all_nodes.<locals>.runner_helper
    locals:
      path.creator = <local> Job<work/i6_core/returnn/search/SearchTakeBestJob.GhRjEqlCh70Z>
    line: runner(path.creator)
      self = <local> <multiprocessing.pool.ThreadPool state=CLOSE pool_size=5>
      runner = <local> <function SISGraph.for_all_nodes.<locals>.runner at 0x7fa031935ee0>
    locals:
      self._check_running = <local> <bound method Pool._check_running of <multiprocessing.pool.ThreadPool state=CLOSE pool_size=5>>
      path = <local> <Path /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/search/SearchTakeBestJob.v4UPalgUJ67C/output/best_search_r
esults.py.gz>
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 559, in SISGraph.for_all_nodes.<locals>.runner
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 353, in Pool._check_running
      runner = <local> <function SISGraph.for_all_nodes.<locals>.runner at 0x7fa031935ee0>
      path = <local> <Path /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/search/SearchTakeBestJob.nSpfsxY9BoqY/output/best_search_r
esults.py.gz>
    line: raise ValueError("Pool not running")
      tools = <global> <module 'sisyphus.tools' from '/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py'>
      path.creator = <local> Job<work/i6_core/returnn/search/SearchTakeBestJob.nSpfsxY9BoqY>
    line: visited[sis_id] = pool.apply_async(
              tools.default_handle_exception_interrupt_main_thread(runner_helper), (job,)
          )
      path.creator = <local> Job<work/i6_core/returnn/search/SearchTakeBestJob.v4UPalgUJ67C>
    locals:
      tools.default_handle_exception_interrupt_main_thread = <global> <function default_handle_exception_interrupt_main_thread at 0x7fa1190e20
c0>
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 559, in SISGraph.for_all_nodes.<locals>.runner
      ValueError = <builtin> <class 'ValueError'>
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 559, in SISGraph.for_all_nodes.<locals>.runner
    line: visited[sis_id] = pool.apply_async(
              tools.default_handle_exception_interrupt_main_thread(runner_helper), (job,)
          )
      runner_helper = <local> <function SISGraph.for_all_nodes.<locals>.runner_helper at 0x7fa031936660>
    locals:
    locals:
    line: visited[sis_id] = pool.apply_async(
              tools.default_handle_exception_interrupt_main_thread(runner_helper), (job,)
          )
      visited = <local> {'i6_experiments/users/zeyer/recog/GetBestRecogTrainExp.BZritgYEgoDE': <multiprocessing.pool.ApplyResult object at 0x7
fa0302aec10>, 'i6_experiments/users/zeyer/datasets/score_results/JoinScoreResultsJob.SzY46iK1xF1G': <multiprocessing.pool.ApplyResult object a
t 0x7fa032bdab50>, 'i6_experiments/us..., len = 15735
      visited = <local> {'i6_experiments/users/zeyer/recog/GetBestRecogTrainExp.BZritgYEgoDE': <multiprocessing.pool.ApplyResult object at 0x7
fa0302aec10>, 'i6_experiments/users/zeyer/datasets/score_results/JoinScoreResultsJob.SzY46iK1xF1G': <multiprocessing.pool.ApplyResult object a
t 0x7fa032bdab50>, 'i6_experiments/us..., len = 15735
ValueError: Pool not running
    locals:
      sis_id = <local> 'i6_core/returnn/search/SearchTakeBestJob.GhRjEqlCh70Z', len = 53
      sis_id = <local> 'i6_core/returnn/search/SearchTakeBestJob.nSpfsxY9BoqY', len = 53
      job = <local> Job<work/i6_core/returnn/search/SearchTakeBestJob.fSgCrTuB3AKa>
      visited = <local> {'i6_experiments/users/zeyer/recog/GetBestRecogTrainExp.BZritgYEgoDE': <multiprocessing.pool.ApplyResult object at 0x7
fa0302aec10>, 'i6_experiments/users/zeyer/datasets/score_results/JoinScoreResultsJob.SzY46iK1xF1G': <multiprocessing.pool.ApplyResult object a
t 0x7fa032bdab50>, 'i6_experiments/us..., len = 15735
      pool = <local> <multiprocessing.pool.ThreadPool state=CLOSE pool_size=5>
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 458, in Pool.apply_async
[2024-12-14 07:11:21,966] INFO: Got user interrupt signal stop engine and exit
    line: self._check_running()
      sis_id = <local> 'i6_core/returnn/search/SearchTakeBestJob.v4UPalgUJ67C', len = 53
    locals:
      pool.apply_async = <local> <bound method Pool.apply_async of <multiprocessing.pool.ThreadPool state=CLOSE pool_size=5>>
      pool = <local> <multiprocessing.pool.ThreadPool state=CLOSE pool_size=5>
      pool.apply_async = <local> <bound method Pool.apply_async of <multiprocessing.pool.ThreadPool state=CLOSE pool_size=5>>
      tools = <global> <module 'sisyphus.tools' from '/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py'>
      pool = <local> <multiprocessing.pool.ThreadPool state=CLOSE pool_size=5>
      tools = <global> <module 'sisyphus.tools' from '/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py'>
      tools.default_handle_exception_interrupt_main_thread = <global> <function default_handle_exception_interrupt_main_thread at 0x7fa1190e20
c0>
      pool.apply_async = <local> <bound method Pool.apply_async of <multiprocessing.pool.ThreadPool state=CLOSE pool_size=5>>
      self = <local> <multiprocessing.pool.ThreadPool state=CLOSE pool_size=5>
      self._check_running = <local> <bound method Pool._check_running of <multiprocessing.pool.ThreadPool state=CLOSE pool_size=5>>
      runner_helper = <local> <function SISGraph.for_all_nodes.<locals>.runner_helper at 0x7fa031936660>
      tools.default_handle_exception_interrupt_main_thread = <global> <function default_handle_exception_interrupt_main_thread at 0x7fa1190e20
c0>
      runner_helper = <local> <function SISGraph.for_all_nodes.<locals>.runner_helper at 0x7fa031936660>
Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stderr>'> at interpreter shutdown, possibly du
e to daemon threads
Python runtime state: finalizing (tstate=0x00007fa11a2ef2d8)

Current thread 0x00007fa119adc000 (most recent call first):
  <no Python frame>

Extension modules: psutil._psutil_linux, psutil._psutil_posix, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath
_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt1993
7, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, h5py._errors, h5py.defs, h5py
._objects, h5py.h5, h5py.utils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py
.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, markupsafe._speedups, _cffi_backend, sentencepi
ece._sentencepiece (total: 42)
fish: Job 2, '/work/tools/users/zeyer/py-envs…' terminated by signal SIGABRT (Abort)

Or when interrupting:

...
[2024-12-16 07:50:34,490] INFO: running(2) waiting(59)
^C[2024-12-16 07:50:57,545] INFO: Got user interrupt signal stop engine and exit
[2024-12-16 07:50:57,545] WARNING: Main thread exit. Still running non-daemon threads: {<LocalEngine(Thread-1, started 140311858968128)>}
[2024-12-16 07:50:57,572] ERROR: Exception in thread <DummyProcess(Thread-36 (worker), started daemon 140309617940032)>:
[2024-12-16 07:50:57,675] ERROR: Exception in thread <DummyProcess(Thread-39 (worker), started daemon 140309115688512)>:
[2024-12-16 07:50:57,676] ERROR: Exception in thread <DummyProcess(Thread-38 (worker), started daemon 140309124081216)>:
[2024-12-16 07:50:57,746] ERROR: Exception in thread <DummyProcess(Thread-37 (worker), started daemon 140309132473920)>:
[2024-12-16 07:50:57,887] ERROR: Exception in thread <DummyProcess(Thread-40 (worker), started daemon 140309107295808)>:
EXCEPTION
Traceback (most recent call last):
(Exclude vars because we are exiting.)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 303, in default_handle_exception_interrupt_main_thread.<lo
cals>.wrapped_func
    line: return func(*args, **kwargs)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 582, in SISGraph.for_all_nodes.<locals>.runner_helper
    line: runner(path.creator)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 559, in SISGraph.for_all_nodes.<locals>.runner
    line: visited[sis_id] = pool.apply_async(
              tools.default_handle_exception_interrupt_main_thread(runner_helper), (job,)
          )
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 458, in Pool.apply_async
    line: self._check_running()
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 353, in Pool._check_running
    line: raise ValueError("Pool not running")
ValueError: Pool not running
EXCEPTION
Exception ignored in atexit callback: <function _exit_function at 0x7f9d0836e160>
Traceback (most recent call last):
EXCEPTION
Traceback (most recent call last):
(Exclude vars because we are exiting.)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 303, in default_handle_exception_interrupt_main_thread.<lo
cals>.wrapped_func
    line: return func(*args, **kwargs)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 582, in SISGraph.for_all_nodes.<locals>.runner_helper
    line: runner(path.creator)
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/util.py", line 334, in _exit_function
Traceback (most recent call last):
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 559, in SISGraph.for_all_nodes.<locals>.runner
    line: visited[sis_id] = pool.apply_async(
              tools.default_handle_exception_interrupt_main_thread(runner_helper), (job,)
          )
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 458, in Pool.apply_async
    line: self._check_running()
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 353, in Pool._check_running
    line: raise ValueError("Pool not running")
ValueError: Pool not running
(Exclude vars because we are exiting.)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 303, in default_handle_exception_interrupt_main_thread.<lo
cals>.wrapped_func
    line: return func(*args, **kwargs)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 582, in SISGraph.for_all_nodes.<locals>.runner_helper
    line: runner(path.creator)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 559, in SISGraph.for_all_nodes.<locals>.runner
EXCEPTION
    line: visited[sis_id] = pool.apply_async(
              tools.default_handle_exception_interrupt_main_thread(runner_helper), (job,)
          )
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 458, in Pool.apply_async
    line: self._check_running()
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 353, in Pool._check_running
    line: raise ValueError("Pool not running")
ValueError: Pool not running
Traceback (most recent call last):
EXCEPTION
Traceback (most recent call last):
KeyboardInterrupt: 
(Exclude vars because we are exiting.)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 303, in default_handle_exception_interrupt_main_thread.<lo
cals>.wrapped_func
    line: return func(*args, **kwargs)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 582, in SISGraph.for_all_nodes.<locals>.runner_helper
    line: runner(path.creator)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 559, in SISGraph.for_all_nodes.<locals>.runner
(Exclude vars because we are exiting.)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 303, in default_handle_exception_interrupt_main_thread.<lo
cals>.wrapped_func
    line: return func(*args, **kwargs)
    line: visited[sis_id] = pool.apply_async(
              tools.default_handle_exception_interrupt_main_thread(runner_helper), (job,)
          )
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 458, in Pool.apply_async
    line: self._check_running()
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 353, in Pool._check_running
Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stderr>'> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=0x00007f9d0981d2d8)

Current thread 0x00007f9d0900a000 (most recent call first):
  <no Python frame>

Extension modules: psutil._psutil_linux, psutil._psutil_posix, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, markupsafe._speedups, _cffi_backend, sentencepiece._sentencepiece (total: 42)
fish: Job 2, '/work/tools/users/zeyer/py-envs…' terminated by signal SIGABRT (Abort)

@albertz albertz changed the title Crash after user interrupt Crash at exit Dec 16, 2024
@albertz
Copy link
Member Author

albertz commented Dec 16, 2024

I looked at bit at the code. There are multiple problems:

I assume some background thread is currently iterating through the graph (SISGraph.for_all_nodes). This is probably the JobCleaner.

This is how the pool was created:

            self._pool = ThreadPool(gs.GRAPH_WORKER)
            atexit.register(self._pool.close)

Similarly, the JobCleaner also calls self.thread_pool.close().

So I guess this atexit handler gets called, then this triggers all the ValueError: Pool not running. I don't really see a clean way to catch this ValueError: Pool not running. So I think we should always properly check some own managed stopped/exited attributed. (E.g., in RETURNN, at exit, we set the global sys.exited = True, and then check for that. Maybe not the cleanest way, but works.)

Further, the default_handle_exception_interrupt_main_thread is hammering the main thread with SIGINT. I'm not sure this is causing any problems.

But then also, the spamming of errors is unnecessary. In all code, before we do any pool.apply_async, we should check whether we can still do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants