Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileCache assertion on previous copy attempt age triggered #1643

Closed
NeoLegends opened this issue Nov 9, 2024 · 11 comments
Closed

FileCache assertion on previous copy attempt age triggered #1643

NeoLegends opened this issue Nov 9, 2024 · 11 comments
Assignees

Comments

@NeoLegends
Copy link
Collaborator

Waiting for lock-file: /ssd/[...]/returnn/file_cache/nas/data/[...]/data[...].lock
Waiting for lock-file: /ssd/[...]/returnn/file_cache/nas/data/[...]/data[...].lock
Waiting for lock-file: /ssd/[...]/returnn/file_cache/nas/data/[...]/data[...].lock
Removing old lockfile '/ssd/[...]/returnn/file_cache/nas/data/[...]/data[...].lock' (probably crashed proc).

[...]

  File "/nas/models/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.SdBYnl2SGErb/output/returnn/returnn/util/file_cache.py", line 294, in FileCache.handle_cached_files_in_config.<locals>._handle_value
    line: res = self.get_file(value.filename)
    locals:
      res = <not found>
      self = <local> <returnn.util.file_cache.FileCache object at 0x7efe22723b80>
      self.get_file = <local> <bound method FileCache.get_file of <returnn.util.file_cache.FileCache object at 0x7efe22723b80>>
      value = <local> CachedFile(filename='/nas/data/[...]/data[...].hdf')
      value.filename = <local> '/nas/data/[...]/data[...].hdf', len = 100
  File "/nas/models/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.SdBYnl2SGErb/output/returnn/returnn/util/file_cache.py", line 120, in FileCache.get_file
    line: self._copy_file_if_needed(src_filename, dst_filename)
    locals:
      self = <local> <returnn.util.file_cache.FileCache object at 0x7efe22723b80>
      self._copy_file_if_needed = <local> <bound method FileCache._copy_file_if_needed of <returnn.util.file_cache.FileCache object at 0x7efe22723b80>>
      src_filename = <local> '/nas/data/[...]/data[...].hdf', len = 100
      dst_filename = <local> '/ssd/[...]/returnn/file_cache/nas/data/[...]/data[...].hdf', len = 138
  File "/nas/models/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.SdBYnl2SGErb/output/returnn/returnn/util/file_cache.py", line 346, in FileCache._copy_file_if_needed
    line: assert dst_tmp_file_age > self._lock_timeout * 0.8, (
              f"FileCache: Expected left-over temp copy file {dst_tmp_filename}"
              f" from crashed previous copy attempt"
              f" to be older than {self._lock_timeout * 0.8:.1f}s but it is {dst_tmp_file_age} seconds old"
          )
    locals:
      dst_tmp_file_age = <local> 0.3813760280609131
      self = <local> <returnn.util.file_cache.FileCache object at 0x7efe22723b80>
      self._lock_timeout = <local> 20
AssertionError: FileCache: Expected left-over temp copy file /ssd/[...]/returnn/file_cache/nas/data/[...]/data[...].hdf.copy from crashed previous copy attempt to be older than 16.0s but it is 0.3813760280609131 seconds old

We also get:

File "/nas/models/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.SdBYnl2SGErb/output/returnn/returnn/util/file_cache.py", line 479, in run
    os.utime(filename, None)
FileNotFoundError: [Errno 2] No such file or directory

One process acquires the lock file but then seems to fail to update the mtime frequently enough for the others to wait. Another process then deletes the lock file and starts copying by itself as well, leading to the assertion triggering because the other process is still copying the data.

This is probably due to FS slowness. I'm not sure what to best do here. We can

  1. plainly increase the lock timeout
  2. measure the time FS operations take and adjust the timeout based on that (seems difficult to do properly)

In any case we can add the lock file to the touch files thread immediately after acquiring it so that the gap between acquiring the lock and trying to update its mtime is as small as possible and includes no FS operations in between that could cause additional delays.

@albertz
Copy link
Member

albertz commented Nov 11, 2024

After #1644 was merged now, please check whether this problem still occurs.

@NeoLegends
Copy link
Collaborator Author

This has not occured again on the same problematic node.

@NeoLegends
Copy link
Collaborator Author

NeoLegends commented Dec 6, 2024

Another instance of this error has occured w/ RETURNN 8441d07, on a node where the local disk is close to full:

EXCEPTION
Traceback (most recent call last):
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/rnn.py", line 11, in <module>
    line: main()
    locals:
      main = <local> <function main at 0x7fe058881a20>
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/__main__.py", line 743, in main
    line: init(command_line_options=argv[1:])
    locals:
      init = <global> <function init at 0x7fe058881750>
      command_line_options = <not found>
      argv = <local> ['/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/rnn.py', '/nas/models/asr/am/[...]/work/i6_core/returnn/training/ReturnnTrainingJob.e1YDy57rgK2b/output/retur..., _[0]: {len = 148}
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/__main__.py", line 486, in init
    line: init_data()
    locals:
      init_data = <global> <function init_data at 0x7fe0588813f0>
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/__main__.py", line 218, in init_data
    line: train_data, extra_train = load_data(config, train_cache_bytes, "train")
    locals:
      train_data = <global> None
      extra_train = <not found>
      load_data = <global> <function load_data at 0x7fe058881360>
      config = <global> <returnn.config.Config object at 0x7fe05898feb0>
      train_cache_bytes = <local> 94489280513
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/__main__.py", line 190, in load_data
    line: data = init_dataset(kwargs)
    locals:
      data = <not found>
      init_dataset = <global> <function init_dataset at 0x7fe058852440>
      kwargs = <local> {'name': 'train', 'class': 'PostprocessingDataset', 'dataset': {'class': 'MetaDataset', 'data_map': {'classes': ('alignments', 'data'), 'data': ('features', 'data')}, 'datasets': {'alignments': {'class': 'HDFDataset', 'files': [CachedFile(filename='/nas/data/speech/[...]/corpus/batch.1...., len = 7
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/datasets/basic.py", line 1511, in init_dataset
    line: kwargs, cached_files = cache.handle_cached_files_in_config(kwargs)
    locals:
      kwargs = <local> {'name': 'train', 'dataset': {'class': 'MetaDataset', 'data_map': {'classes': ('alignments', 'data'), 'data': ('features', 'data')}, 'datasets': {'alignments': {'class': 'HDFDataset', 'files': [CachedFile(filename='/nas/data/speech/[...]/corpus/batch.1.v1/hdf-gmm_20241204.alignment.spli..., len = 6
      cached_files = <not found>
      cache = <local> <returnn.util.file_cache.FileCache object at 0x7fdf33b2dff0>
      cache.handle_cached_files_in_config = <local> <bound method FileCache.handle_cached_files_in_config of <returnn.util.file_cache.FileCache object at 0x7fdf33b2dff0>>
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/util/file_cache.py", line 299, in FileCache.handle_cached_files_in_config
    line: return tree.map_structure(_handle_value, config), res_files
    locals:
      tree = <local> <module 'tree' from '/usr/local/lib/python3.10/dist-packages/tree/__init__.py'>
      tree.map_structure = <local> <function map_structure at 0x7fdf3e932b90>
      _handle_value = <local> <function FileCache.handle_cached_files_in_config.<locals>._handle_value at 0x7fdf33b35510>
      config = <local> {'name': 'train', 'dataset': {'class': 'MetaDataset', 'data_map': {'classes': ('alignments', 'data'), 'data': ('features', 'data')}, 'datasets': {'alignments': {'class': 'HDFDataset', 'files': [CachedFile(filename='/nas/data/speech/[...]/corpus/batch.1.v1/hdf-gmm_20241204.alignment.spli..., len = 6
      res_files = <local> ['/ssd/pzheng/returnn/file_cache/nas/data/speech/[...]/corpus/batch.1.v1/hdf-gmm_20241204.alignment.split-151/batch.1.v1.hdf.1', '/ssd/pzheng/returnn/file_cache/nas/data/speech/[...]/corpus/batch.1.v1/hdf-gmm_20241204.alignment.split-151/batch.1.v1.hdf.2', '/ssd/pzheng..., len = 498, _[0]: {len = 138}
  File "/usr/local/lib/python3.10/dist-packages/tree/__init__.py", line 435, in map_structure
    line: return unflatten_as(structures[0],
                              [func(*args) for args in zip(*map(flatten, structures))])
    locals:
      unflatten_as = <global> <function unflatten_as at 0x7fdf3e932b00>
      structures = <local> ({'name': 'train', 'dataset': {'class': 'MetaDataset', 'data_map': {'classes': ('alignments', 'data'), 'data': ('features', 'data')}, 'datasets': {'alignments': {'class': 'HDFDataset', 'files': [CachedFile(filename='/nas/data/speech/[...]/corpus/batch.1.v1/hdf-gmm_20241204.alignment.spl..., _[0]: {len = 6}
      func = <local> <function FileCache.handle_cached_files_in_config.<locals>._handle_value at 0x7fdf33b35510>
      args = <not found>
      zip = <builtin> <class 'zip'>
      map = <builtin> <class 'map'>
      flatten = <global> <function flatten at 0x7fdf3e932830>
  File "/usr/local/lib/python3.10/dist-packages/tree/__init__.py", line 435, in <listcomp>
    line: return unflatten_as(structures[0],
                              [func(*args) for args in zip(*map(flatten, structures))])
    locals:
      unflatten_as = <global> <function unflatten_as at 0x7fdf3e932b00>
      structures = <not found>
      func = <local> <function FileCache.handle_cached_files_in_config.<locals>._handle_value at 0x7fdf33b35510>
      args = <local> (CachedFile(filename='/nas/data/speech/[...]/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-151/batch.1.v1.hdf.76'),)
      zip = <builtin> <class 'zip'>
      map = <builtin> <class 'map'>
      flatten = <global> <function flatten at 0x7fdf3e932830>
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/util/file_cache.py", line 294, in FileCache.handle_cached_files_in_config.<locals>._handle_value
    line: res = self.get_file(value.filename)
    locals:
      res = <not found>
      self = <local> <returnn.util.file_cache.FileCache object at 0x7fdf33b2dff0>
      self.get_file = <local> <bound method FileCache.get_file of <returnn.util.file_cache.FileCache object at 0x7fdf33b2dff0>>
      value = <local> CachedFile(filename='/nas/data/speech/[...]/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-151/batch.1.v1.hdf.76')
      value.filename = <local> '/nas/data/speech/[...]/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-151/batch.1.v1.hdf.76', len = 100
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/util/file_cache.py", line 120, in FileCache.get_file
    line: self._copy_file_if_needed(src_filename, dst_filename)
    locals:
      self = <local> <returnn.util.file_cache.FileCache object at 0x7fdf33b2dff0>
      self._copy_file_if_needed = <local> <bound method FileCache._copy_file_if_needed of <returnn.util.file_cache.FileCache object at 0x7fdf33b2dff0>>
      src_filename = <local> '/nas/data/speech/[...]/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-151/batch.1.v1.hdf.76', len = 100
      dst_filename = <local> '/ssd/pzheng/returnn/file_cache/nas/data/speech/[...]/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-151/batch.1.v1.hdf.76', len = 130
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/util/file_cache.py", line 346, in FileCache._copy_file_if_needed
    line: assert dst_tmp_file_age > self._lock_timeout * 0.8, (
              f"FileCache: Expected left-over temp copy file {dst_tmp_filename}"
              f" from crashed previous copy attempt"
              f" to be older than {self._lock_timeout * 0.8:.1f}s but it is {dst_tmp_file_age} seconds old"
          )
    locals:
      dst_tmp_file_age = <local> 0.12685394287109375
      self = <local> <returnn.util.file_cache.FileCache object at 0x7fdf33b2dff0>
      self._lock_timeout = <local> 20
AssertionError: FileCache: Expected left-over temp copy file /ssd/pzheng/returnn/file_cache/nas/data/speech/[...]/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-151/batch.1.v1.hdf.76.copy from crashed previous copy attempt to be older than 16.0s but it is 0.12685394287109375 seconds old

Furthermore, again it seems a lock file was removed by another process:

EXCEPTION
Traceback (most recent call last):
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/rnn.py", line 11, in <module>
    line: main()
    locals:
      main = <local> <function main at 0x7fe7c19e5a20>
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/__main__.py", line 743, in main
    line: init(command_line_options=argv[1:])
    locals:
      init = <global> <function init at 0x7fe7c19e5750>
      command_line_options = <not found>
      argv = <local> ['/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/rnn.py', '/nas/models/asr/am/[...]/work/i6_core/returnn/training/ReturnnTrainingJob.e1YDy57rgK2b/output/retur..., _[0]: {len = 148}
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/__main__.py", line 486, in init
    line: init_data()
    locals:
      init_data = <global> <function init_data at 0x7fe7c19e53f0>
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/__main__.py", line 218, in init_data
    line: train_data, extra_train = load_data(config, train_cache_bytes, "train")
    locals:
      train_data = <global> None
      extra_train = <not found>
      load_data = <global> <function load_data at 0x7fe7c19e5360>
      config = <global> <returnn.config.Config object at 0x7fe7c1affeb0>
      train_cache_bytes = <local> 94489280513
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/__main__.py", line 190, in load_data
    line: data = init_dataset(kwargs)
    locals:
      data = <not found>
      init_dataset = <global> <function init_dataset at 0x7fe7c19be440>
      kwargs = <local> {'name': 'train', 'class': 'PostprocessingDataset', 'dataset': {'class': 'MetaDataset', 'data_map': {'classes': ('alignments', 'data'), 'data': ('features', 'data')}, 'datasets': {'alignments': {'class': 'HDFDataset', 'files': [CachedFile(filename='/nas/data/speech/[...]/corpus/batch.1...., len = 7
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/datasets/basic.py", line 1511, in init_dataset
    line: kwargs, cached_files = cache.handle_cached_files_in_config(kwargs)
    locals:
      kwargs = <local> {'name': 'train', 'dataset': {'class': 'MetaDataset', 'data_map': {'classes': ('alignments', 'data'), 'data': ('features', 'data')}, 'datasets': {'alignments': {'class': 'HDFDataset', 'files': [CachedFile(filename='/nas/data/speech/[...]/corpus/batch.1.v1/hdf-gmm_20241204.alignment.spli..., len = 6
      cached_files = <not found>
      cache = <local> <returnn.util.file_cache.FileCache object at 0x7fe6a04d1ff0>
      cache.handle_cached_files_in_config = <local> <bound method FileCache.handle_cached_files_in_config of <returnn.util.file_cache.FileCache object at 0x7fe6a04d1ff0>>
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/util/file_cache.py", line 299, in FileCache.handle_cached_files_in_config
    line: return tree.map_structure(_handle_value, config), res_files
    locals:
      tree = <local> <module 'tree' from '/usr/local/lib/python3.10/dist-packages/tree/__init__.py'>
      tree.map_structure = <local> <function map_structure at 0x7fe6a7a96b90>
      _handle_value = <local> <function FileCache.handle_cached_files_in_config.<locals>._handle_value at 0x7fe6a04d9510>
      config = <local> {'name': 'train', 'dataset': {'class': 'MetaDataset', 'data_map': {'classes': ('alignments', 'data'), 'data': ('features', 'data')}, 'datasets': {'alignments': {'class': 'HDFDataset', 'files': [CachedFile(filename='/nas/data/speech/[...]/corpus/batch.1.v1/hdf-gmm_20241204.alignment.spli..., len = 6
      res_files = <local> ['/ssd/pzheng/returnn/file_cache/nas/data/speech/[...]/corpus/batch.1.v1/hdf-gmm_20241204.alignment.split-151/batch.1.v1.hdf.1', '/ssd/pzheng/returnn/file_cache/nas/data/speech/[...]/corpus/batch.1.v1/hdf-gmm_20241204.alignment.split-151/batch.1.v1.hdf.2', '/ssd/pzheng..., len = 498, _[0]: {len = 138}
  File "/usr/local/lib/python3.10/dist-packages/tree/__init__.py", line 435, in map_structure
    line: return unflatten_as(structures[0],
                              [func(*args) for args in zip(*map(flatten, structures))])
    locals:
      unflatten_as = <global> <function unflatten_as at 0x7fe6a7a96b00>
      structures = <local> ({'name': 'train', 'dataset': {'class': 'MetaDataset', 'data_map': {'classes': ('alignments', 'data'), 'data': ('features', 'data')}, 'datasets': {'alignments': {'class': 'HDFDataset', 'files': [CachedFile(filename='/nas/data/speech/[...]/corpus/batch.1.v1/hdf-gmm_20241204.alignment.spl..., _[0]: {len = 6}
      func = <local> <function FileCache.handle_cached_files_in_config.<locals>._handle_value at 0x7fe6a04d9510>
      args = <not found>
      zip = <builtin> <class 'zip'>
      map = <builtin> <class 'map'>
      flatten = <global> <function flatten at 0x7fe6a7a96830>
  File "/usr/local/lib/python3.10/dist-packages/tree/__init__.py", line 435, in <listcomp>
    line: return unflatten_as(structures[0],
                              [func(*args) for args in zip(*map(flatten, structures))])
    locals:
      unflatten_as = <global> <function unflatten_as at 0x7fe6a7a96b00>
      structures = <not found>
      func = <local> <function FileCache.handle_cached_files_in_config.<locals>._handle_value at 0x7fe6a04d9510>
      args = <local> (CachedFile(filename='/nas/data/speech/[...]/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-151/batch.1.v1.hdf.76'),)
      zip = <builtin> <class 'zip'>
      map = <builtin> <class 'map'>
      flatten = <global> <function flatten at 0x7fe6a7a96830>
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/util/file_cache.py", line 294, in FileCache.handle_cached_files_in_config.<locals>._handle_value
    line: res = self.get_file(value.filename)
    locals:
      res = <not found>
      self = <local> <returnn.util.file_cache.FileCache object at 0x7fe6a04d1ff0>
      self.get_file = <local> <bound method FileCache.get_file of <returnn.util.file_cache.FileCache object at 0x7fe6a04d1ff0>>
      value = <local> CachedFile(filename='/nas/data/speech/[...]/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-151/batch.1.v1.hdf.76')
      value.filename = <local> '/nas/data/speech/[...]/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-151/batch.1.v1.hdf.76', len = 100
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/util/file_cache.py", line 126, in FileCache.get_file
    line: raise e
    locals:
      e = <not found>
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/util/file_cache.py", line 120, in FileCache.get_file
    line: self._copy_file_if_needed(src_filename, dst_filename)
    locals:
      self = <not found>
      self._copy_file_if_needed = <not found>
      src_filename = <not found>
      dst_filename = <not found>
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/util/file_cache.py", line 328, in FileCache._copy_file_if_needed
    line: with LockFile(
              directory=dst_dir, name=os.path.basename(dst_filename) + ".lock", lock_timeout=self._lock_timeout
          ) as lock, self._touch_files_thread.files_added_context(lock.lockfile):
    locals:
      LockFile = <global> <class 'returnn.util.basic.LockFile'>
      directory = <not found>
      dst_dir = <local> '/ssd/pzheng/returnn/file_cache/nas/data/speech/[...]/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-151', len = 108
      name = <not found>
      os = <global> <module 'os' from '/usr/lib/python3.10/os.py'>
      os.path = <global> <module 'posixpath' from '/usr/lib/python3.10/posixpath.py'>
      os.path.basename = <global> <function basename at 0x7fe7ebb4e9e0>
      dst_filename = <local> '/ssd/pzheng/returnn/file_cache/nas/data/speech/[...]/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-151/batch.1.v1.hdf.76', len = 130
      lock_timeout = <not found>
      self = <local> <returnn.util.file_cache.FileCache object at 0x7fe6a04d1ff0>
      self._lock_timeout = <local> 20
      lock = <local> <returnn.util.basic.LockFile object at 0x7fe6a052c730>
      self._touch_files_thread = <local> <_TouchFilesThread(Thread-4, stopped daemon 140627891185216)>
      self._touch_files_thread.files_added_context = <local> <bound method _TouchFilesThread.files_added_context of <_TouchFilesThread(Thread-4, stopped daemon 140627891185216)>>
      lock.lockfile = <local> '/ssd/pzheng/returnn/file_cache/nas/data/speech/[...]/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-151/batch.1.v1.hdf.76.lock', len = 135
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/util/basic.py", line 3300, in LockFile.__exit__
    line: self.unlock()
    locals:
      self = <local> <returnn.util.basic.LockFile object at 0x7fe6a052c730>
      self.unlock = <local> <bound method LockFile.unlock of <returnn.util.basic.LockFile object at 0x7fe6a052c730>>
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/util/basic.py", line 3293, in LockFile.unlock
    line: os.remove(self.lockfile)
    locals:
      os = <global> <module 'os' from '/usr/lib/python3.10/os.py'>
      os.remove = <global> <built-in function remove>
      self = <local> <returnn.util.basic.LockFile object at 0x7fe6a052c730>
      self.lockfile = <local> '/ssd/pzheng/returnn/file_cache/nas/data/speech/[...]/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-151/batch.1.v1.hdf.76.lock', len = 135
FileNotFoundError: [Errno 2] No such file or directory: '/ssd/pzheng/returnn/file_cache/nas/data/speech/[...]/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-151/batch.1.v1.hdf.76.lock'

@NeoLegends NeoLegends reopened this Dec 6, 2024
@albertz
Copy link
Member

albertz commented Dec 6, 2024

Do we log when we remove the lock file? E.g. log why it is being removed, the recent mtime, etc.

@NeoLegends
Copy link
Collaborator Author

NeoLegends commented Dec 6, 2024

Yes:

Removing old lockfile '/ssd/pzheng/[...]/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-151/batch.1.v1.hdf.76.lock' (probably crashed proc).

We don't log the mtime, but we have set the lock timeout in the file cache to 20s by default.

@NeoLegends
Copy link
Collaborator Author

NeoLegends commented Dec 6, 2024

WDYT on adding another check to LockFile that tries to figure out if the lock file is currently open/held by another process, instead of relying on the mtime? Do you think this can be reliable at all? E.g. there is https://psutil.readthedocs.io/en/latest/#psutil.Process.open_files. I'd have to check how this behaves for non-child processes and processes of other users (will likely fail there), but given that we customize the cache path on a per-user basis looking at the users own processes will be enough.

@NeoLegends
Copy link
Collaborator Author

See #1663

@albertz
Copy link
Member

albertz commented Dec 6, 2024

WDYT on adding another check to LockFile that tries to figure out if the lock file is currently open/held by another process, instead of relying on the mtime? Do you think this can be reliable at all?

I don't like this.

But also, I don't really understand the issue. How does this actually happen? Why is the mtime not reliable? How can that be? And we should be able to make this reliable, or not? So then, no other weird workaround is needed.

@NeoLegends
Copy link
Collaborator Author

NeoLegends commented Dec 6, 2024

A (very) slow FS could be one reason the mtime is unreliable, no? In that case I don't know if there is anything we can do to make it reliable besides increasing the lock timeout. And then this decreases the usability in the good case, where there isn't any FS slowness but an actually crashed process that has left behind stale lockfiles.

@albertz
Copy link
Member

albertz commented Dec 6, 2024

A (very) slow FS could be one reason the mtime is unreliable, no?

But we are writing to a local disk here? I expect maybe that this is busy for a few milliseconds, not more. This cannot explain that mtime is unreliable.

Maybe some weird FS flags while mounting the disk could cause this? But I don't think so.

So, I highly doubt that this is the problem. To me it looks like some bug somewhere. Before we add any further hacks/workarounds, we should understand what goes wrong here.

@NeoLegends
Copy link
Collaborator Author

NeoLegends commented Dec 12, 2024

Another part of the error message is

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
FileCache: using existing file /ssd/[...]/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-151/NPO-batch.1.v1.hdf.20
  File "/nas/models/asr/am/[...]/work/i6_core/tools/git/CloneGitRepositoryJob.vEYVF21o1uRS/output/returnn/returnn/util/file_cache.py", line 475, in run
    for filename in self.files:
RuntimeError: dictionary changed size during iteration

I think this is because the _TouchFilesThread runs I/O operations while iterating the self.files dict, and I believe the GIL is released during these I/O operations, permitting the main thread to modify self.files in the mean time.

This is probably the issue here. The touch files thread crashes and no longer updates the mtimes. 🤷🏼‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants