Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileExistsError in Job._sis_setup_directory #161

Open
albertz opened this issue Dec 5, 2023 · 4 comments
Open

FileExistsError in Job._sis_setup_directory #161

albertz opened this issue Dec 5, 2023 · 4 comments

Comments

@albertz
Copy link
Member

albertz commented Dec 5, 2023

...
[2023-12-05 04:29:29,576] WARNING: interrupted_resumable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6/train work/
i6_core/returnn/training/ReturnnTrainingJob.jyQaF3P8Ieol>               
[2023-12-05 04:29:29,576] INFO: interrupted_resumable(1) retry_error(4) running(8) waiting(663)                                                  
[2023-12-05 04:31:04,825] ERROR: Exception in thread <_MainThread(MainThread, started 140708433694720)>:                                         
EXCEPTION                                                                                                                                        
Traceback (most recent call last):
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 311, in default_handle_exception_interrupt_main_thread.<locals>.wrapped_func
    line: return func(*args, **kwargs)                                                                                                           
    locals:                                                             
      func = <local> <function Manager.run at 0x7ff93ae65940>                                                                                    
      args = <local> (<Manager(Thread-2, initial)>,)                                                                                             
      kwargs = <local> {}                                                                                                                        
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/manager.py", line 617, in Manager.run                   
    line: self.resume_jobs()                                                                                                                     
    locals:                                                              
      self = <local> <Manager(Thread-2, initial)>                                                                                                
      self.resume_jobs = <local> <bound method Manager.resume_jobs of <Manager(Thread-2, initial)>>                                              
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/manager.py", line 405, in Manager.resume_jobs                                
    line: self.thread_pool.map(f, self.jobs.get(gs.STATE_INTERRUPTED_RESUMABLE, []))
    locals:                                                                                                                                      
      self = <local> <Manager(Thread-2, initial)>
      self.thread_pool = <local> <multiprocessing.pool.ThreadPool state=RUN pool_size=10>                                                        
      self.thread_pool.map = <local> <bound method Pool.map of <multiprocessing.pool.ThreadPool state=RUN pool_size=10>>                         
      f = <local> <function Manager.resume_jobs.<locals>.f at 0x7ff92c3c0180>                                                                    
      self.jobs = <local> defaultdict(<class 'set'>, {'waiting': {Job<work/i6_core/returnn/search/SearchWordsToCTMJob.sHh83NBWaNtR>, Job<work/i6_
core/recognition/scoring/ScliteJob.TThbUgE8qjSd>, Job<work/i6_core/returnn/forward/ReturnnForwardJobV2.uX3OywkERCr7>, Job<work/i6_core/returnn/se
arch/SearchRemoveLabelJob.cakEqUo..., len = 5, _[0]: {len = 0}           
      self.jobs.get = <local> <built-in method get of collections.defaultdict object at 0x7ff8ac589c60>                                          
      gs = <global> <module 'sisyphus.global_settings' from '/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/global_settings.py'>    
      gs.STATE_INTERRUPTED_RESUMABLE = <global> 'interrupted_resumable', len = 21                                                                
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 367, in Pool.map                        
    line: return self._map_async(func, iterable, mapstar, chunksize).get()                                                                       
    locals:                                                                                                                                      
      self = <local> <multiprocessing.pool.ThreadPool state=RUN pool_size=10>                                                                    
      self._map_async = <local> <bound method Pool._map_async of <multiprocessing.pool.ThreadPool state=RUN pool_size=10>>                       
      func = <local> <function Manager.resume_jobs.<locals>.f at 0x7ff92c3c0180>                                                                 
      iterable = <local> {Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6/train work/i6_core/returnn/training/Returnn
TrainingJob.jyQaF3P8Ieol>}, len = 1                                                                                                              
      mapstar = <global> <function mapstar at 0x7ff93b797d80>                                                                                    
      chunksize = <local> None                                                                                                                   
      get = <not found>                                                 
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 774, in ApplyResult.get
    line: raise self._value
    locals:
      self = <local> <multiprocessing.pool.MapResult object at 0x7ff8ac510a90>
      self._value = <local> FileExistsError(17, 'File exists')
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    line: result = (True, func(*args, **kwds))
    locals:
      result = <local> None
      func = <local> None
      args = <local> None
      kwds = <local> None
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
    line: return list(map(*args))
    locals:
      list = <builtin> <class 'list'>
      map = <builtin> <class 'map'>
      args = <local> (<function Manager.resume_jobs.<locals>.f at 0x7ff92c3c0180>, (Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_3
0/base-24gb-v6/train work/i6_core/returnn/training/ReturnnTrainingJob.jyQaF3P8Ieol>,))
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/manager.py", line 399, in Manager.resume_jobs.<locals>.f
    line: job._sis_setup_directory(force=True)
    locals:
      job = <local> Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6/train work/i6_core/returnn/training/ReturnnTraini
ngJob.jyQaF3P8Ieol>
      job._sis_setup_directory = <local> <bound method Job._sis_setup_directory of Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30
/base-24gb-v6/train work/i6_core/returnn/training/ReturnnTrainingJob.jyQaF3P8Ieol>>
      force = <not found>
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/job.py", line 284, in Job._sis_setup_directory
    line: os.symlink(src=os.path.abspath(str(creator._sis_path())), dst=link_name, target_is_directory=True)
    locals:
      os = <global> <module 'os' (frozen)>
      os.symlink = <global> <built-in function symlink>
      src = <not found>
      os.path = <global> <module 'posixpath' (frozen)>
      os.path.abspath = <global> <function abspath at 0x7ff93be74f40>
      str = <builtin> <class 'str'>
      creator = <local> Job<work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56NZ8STWt>
      creator._sis_path = <local> <bound method Job._sis_path of Job<work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56NZ8STWt>>
      dst = <not found>
      link_name = <local> 'work/i6_core/returnn/training/ReturnnTrainingJob.jyQaF3P8Ieol/input/i6_core_text_label_subword_nmt_train_ReturnnTrainBpeJob.vTq56NZ8STWt', len = 136
      target_is_directory = <not found>
FileExistsError: [Errno 17] File exists: '/u/zeyer/setups/combined/2021-05-31/work/i6_core/text/label/subword_nmt/train/ReturnnTrainBpeJob.vTq56N
Z8STWt' -> 'work/i6_core/returnn/training/ReturnnTrainingJob.jyQaF3P8Ieol/input/i6_core_text_label_subword_nmt_train_ReturnnTrainBpeJob.vTq56NZ8S
TWt'
[2023-12-05 04:31:05,077] WARNING: Main thread exit. Still running non-daemon threads: {<LocalEngine(Thread-1, started 140708182750784)>}

This is the first time I see this. Probably a very rare issue.

After a restart of the manager, I don't see the problem anymore.

@michelwi
Copy link
Contributor

michelwi commented Dec 5, 2023

Mh.. I have never seen this before.

Do you have two managers running simultaneously that race to create the same work folder?

@albertz
Copy link
Member Author

albertz commented Dec 5, 2023

Do you have two managers running simultaneously that race to create the same work folder?

No.

@Atticus1806
Copy link
Contributor

Atticus1806 commented Dec 5, 2023

I have seen it before, but I am not 100% sure anymore how this was caused. It might have been during the FS problem times on asr3 but i cant tell for sure.

@critias
Copy link
Contributor

critias commented Jan 2, 2024

My first guess would also have been multiple managers or some filesystem problems. The function should not be called in parallel inside sisyphus for the same job. It's called here:

self.thread_pool.map(lambda job: job._sis_setup_directory(), self.jobs.get(gs.STATE_HOLD, []))

Let us know if this problem reappears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants