Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc pull: shows wrong error & does not continue when remote is missing .dir content #10111

Closed
mdekstrand opened this issue Nov 25, 2023 · 3 comments · Fixed by #10215
Closed
Assignees
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? p1-important Important, aka current backlog of things to do research

Comments

@mdekstrand
Copy link

mdekstrand commented Nov 25, 2023

Bug Report

Description

When a tracked directory has not been pushed to a remote, attempting to pull from that remote fails with an incorrect error message (it reports that the local data is missing, not that the remote data is missing), and fails immediately upon encountering the missing data instead of continuing to fetch available data and summarizing unfetchable data at the end of the run like it (historically) does with missing file content.

Reproduce

  1. mkdir repo1
  2. cd repo1
  3. git init
  4. dvc init
  5. git commit -am 'initial'
  6. dvc import-url https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
  7. dvc stage add -n unpack-ml -d ml-latest-small.zip -o ml-latest-small unzip ml-latest-small.zip
  8. dvc repro
  9. git add .
  10. git commit -m run
  11. cd ..
  12. git clone repo1 repo2
  13. cd repo2
  14. dvc remote add --local local ../store
  15. dvc pull -r local

Actual Results

Collecting                                                                                                                                                     |2.00 [00:00, 53.2entry/s]
Fetching
ERROR: unexpected error - failed to load directory ('7a', '07225112ccf8cdc73da1f2d56610d0.dir'): [Errno 2] No such file or directory: '/home/mde48/x/repo2/.dvc/cache/files/md5/7a/07225112ccf8cdc73da1f2d56610d0.dir'

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

Running with --verbose shows the original error:

dvc pull -r local --verbose
2023-11-25 17:10:41,667 DEBUG: v3.30.1 (conda), CPython 3.10.13 on Linux-5.15.0-88-generic-x86_64-with-glibc2.35
2023-11-25 17:10:41,667 DEBUG: command: /home/mde48/miniforge3/envs/bookdata/bin/dvc pull -r local --verbose
Collecting                                                                                                                                                     |0.00 [00:00,    ?entry/s]
2023-11-25 17:10:42,212 DEBUG: failed to load ('7a', '07225112ccf8cdc73da1f2d56610d0.dir') from storage local (/home/mde48/x/store/files/md5) - [Errno 2] No such file or directory: '/home/mde48/x/store/files/md5/7a/07225112ccf8cdc73da1f2d56610d0.dir'
Traceback (most recent call last):
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_data/index/index.py", line 582, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_data/index/index.py", line 518, in _load_from_object_storage
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_data/hashfile/tree.py", line 193, in load
    with obj.fs.open(obj.path, "r") as fobj:
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 228, in open
    return self.fs.open(path, mode=mode, **kwargs)
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_objects/fs/local.py", line 138, in open
    return open(path, mode=mode, encoding=encoding)
FileNotFoundError: [Errno 2] No such file or directory: '/home/mde48/x/store/files/md5/7a/07225112ccf8cdc73da1f2d56610d0.dir'

2023-11-25 17:10:42,212 DEBUG: failed to load ('7a', '07225112ccf8cdc73da1f2d56610d0.dir') from storage local (/home/mde48/x/repo2/.dvc/cache/files/md5) - [Errno 2] No such file or directory: '/home/mde48/x/repo2/.dvc/cache/files/md5/7a/07225112ccf8cdc73da1f2d56610d0.dir'
Traceback (most recent call last):
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_data/index/index.py", line 582, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_data/index/index.py", line 518, in _load_from_object_storage
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_data/hashfile/tree.py", line 193, in load
    with obj.fs.open(obj.path, "r") as fobj:
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 228, in open
    return self.fs.open(path, mode=mode, **kwargs)
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_objects/fs/local.py", line 138, in open
    return open(path, mode=mode, encoding=encoding)
FileNotFoundError: [Errno 2] No such file or directory: '/home/mde48/x/repo2/.dvc/cache/files/md5/7a/07225112ccf8cdc73da1f2d56610d0.dir'

Fetching
2023-11-25 17:10:42,235 ERROR: unexpected error - failed to load directory ('7a', '07225112ccf8cdc73da1f2d56610d0.dir'): [Errno 2] No such file or directory: '/home/mde48/x/repo2/.dvc/cache/files/md5/7a/07225112ccf8cdc73da1f2d56610d0.dir'
Traceback (most recent call last):
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_data/index/index.py", line 582, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_data/index/index.py", line 518, in _load_from_object_storage
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_data/hashfile/tree.py", line 193, in load
    with obj.fs.open(obj.path, "r") as fobj:
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 228, in open
    return self.fs.open(path, mode=mode, **kwargs)
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_objects/fs/local.py", line 138, in open
    return open(path, mode=mode, encoding=encoding)
FileNotFoundError: [Errno 2] No such file or directory: '/home/mde48/x/repo2/.dvc/cache/files/md5/7a/07225112ccf8cdc73da1f2d56610d0.dir'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc/cli/__init__.py", line 211, in main
    ret = cmd.do_run()
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc/cli/command.py", line 27, in do_run
    return self.run()
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc/commands/data_sync.py", line 35, in run
    stats = self.repo.pull(
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc/repo/__init__.py", line 60, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc/repo/pull.py", line 30, in pull
    processed_files_count = self.fetch(
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc/repo/__init__.py", line 60, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc/repo/fetch.py", line 183, in fetch
    fetch_transferred, fetch_failed = ifetch(
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_data/index/fetch.py", line 65, in fetch
    [
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_data/index/fetch.py", line 65, in <listcomp>
    [
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_data/index/index.py", line 725, in iteritems
    self._load(key, entry)
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_data/index/index.py", line 681, in _load
    self.onerror(entry, exc)
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_data/index/index.py", line 609, in _onerror
    raise exc
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_data/index/index.py", line 679, in _load
    _load_from_storage(self._trie, entry, storage_info)
  File "/home/mde48/miniforge3/envs/bookdata/lib/python3.10/site-packages/dvc_data/index/index.py", line 597, in _load_from_storage
    raise DataIndexDirError(f"failed to load directory {entry.key}") from last_exc
dvc_data.index.index.DataIndexDirError: failed to load directory ('7a', '07225112ccf8cdc73da1f2d56610d0.dir')

2023-11-25 17:10:42,319 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-11-25 17:10:42,319 DEBUG: Removing '/home/mde48/x/.NzE8Z5My2DHxS3Atqo8jZB.tmp'
2023-11-25 17:10:42,319 DEBUG: Removing '/home/mde48/x/.NzE8Z5My2DHxS3Atqo8jZB.tmp'
2023-11-25 17:10:42,319 DEBUG: Removing '/home/mde48/x/.NzE8Z5My2DHxS3Atqo8jZB.tmp'
2023-11-25 17:10:42,319 DEBUG: Removing '/home/mde48/x/repo2/.dvc/cache/files/md5/.Ct4xEso9xvhHYoq6xEG5US.tmp'
2023-11-25 17:10:42,322 DEBUG: Version info for developers:
DVC version: 3.30.1 (conda)
---------------------------
Platform: Python 3.10.13 on Linux-5.15.0-88-generic-x86_64-with-glibc2.35
Subprojects:
	dvc_data = 2.22.0
	dvc_objects = 1.2.0
	dvc_render = 0.6.0
	dvc_task = 0.3.0
	scmrepo = 1.5.0
Supports:
	http (aiohttp = 3.9.0, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.9.0, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2023.10.0, boto3 = 1.28.64),
	webdav (webdav4 = 0.9.8),
	webdavs (webdav4 = 0.9.8)
Config:
	Global: /home/mde48/.config/dvc
	System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/ubuntu--vg-ubuntu--lv
Caches: local
Remotes: local
Workspace directory: ext4 on /dev/mapper/ubuntu--vg-ubuntu--lv
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/d8473e8b99bed31ceae24f6a64ffc840

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2023-11-25 17:10:42,323 DEBUG: Analytics is enabled.
2023-11-25 17:10:42,359 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpq0l1dbh7', '-v']
2023-11-25 17:10:42,365 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpq0l1dbh7', '-v'] with pid 331244

Expected

  • The error message to show the true error (the hashed content was not available in the remote), instead of the local error (a missing file in the local cache — of course it isn't in the local cache, that's why I'm trying to fetch)
  • DVC to continue trying to fetch other data from the remote when available (in a larger project, dvc pull fails as soon as this error is encountered, instead of continuing to fetch available data and summarizing the unfetchable data at the end, even when --allow-missing is provided)

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.30.1 (conda)
---------------------------
Platform: Python 3.10.13 on Linux-5.15.0-88-generic-x86_64-with-glibc2.35
Subprojects:
	dvc_data = 2.22.0
	dvc_objects = 1.2.0
	dvc_render = 0.6.0
	dvc_task = 0.3.0
	scmrepo = 1.5.0
Supports:
	http (aiohttp = 3.9.0, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.9.0, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2023.10.0, boto3 = 1.28.64),
	webdav (webdav4 = 0.9.8),
	webdavs (webdav4 = 0.9.8)
Config:
	Global: /home/mde48/.config/dvc
	System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/ubuntu--vg-ubuntu--lv
Caches: local
Remotes: local
Workspace directory: ext4 on /dev/mapper/ubuntu--vg-ubuntu--lv
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/d8473e8b99bed31ceae24f6a64ffc840

Additional Information (if any):

This specific bug seems like it could easily cause more serious repository-wedging problems in the following unfortunate scenario:

  1. A large job finishes.
  2. Results are committed to git and pushed to git remote.
  3. Compute storage fails before the dvc content is fully pushed to the DVC remote; in particular, before the .dir record is pushed.

In this case, it will be impossible to dvc pull the current repository to get the pieces that are available in order to fill in the rest with another dvc repro, without doing surgery to remove the dvc.lock entries that reference the unfetchable directory content. One could argue that pushing to git before dvc is bad practice, but a situation where the dvc repository can no longer even try to fetch content because the only copy of the fetched content is on a dead disk seems suboptimal.

Finally, a broader context comment — while upgrading several projects from DVC 2.x or early 3.x to recent (3.28 and later?) DVC, it appears to me that the quality of DVC error messages, and robustness to storage errors, has gone down remarkably; from the stack traces, it looks like errors often happen at the new dvc-data layer, but propagate upwards in a way that causes the user-facing layer to misattribute their cause in the primary error message (for example, reporting an “unexpected error” in this case for a subclass of the condition of a remote not having all the data dvc expects to fetch). See also #10076, for example, or #10065 (which was not quite the same problem but also seemed to be a species of errors manifesting as unrelated-looking problems, at least in its user-visible effects). This is very frustrating, and it also makes it harder for me to teach others how to use DVC when the error messages do not accurately and actionably report the errors encountered. It also seems to be a serious regression from earlier versions of DVC, where storage errors were more directly reported and the software recovered more reliably (in my experience, at least, using it primarily in a non-cloud environment).

@dberenbaum dberenbaum added bug Did we break something? p1-important Important, aka current backlog of things to do labels Dec 1, 2023
@dberenbaum dberenbaum added this to DVC Dec 1, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Dec 1, 2023
@dberenbaum dberenbaum moved this from Backlog to Todo in DVC Dec 1, 2023
@dberenbaum dberenbaum added the A: data-sync Related to dvc get/fetch/import/pull/push label Dec 1, 2023
@kbumsik
Copy link

kbumsik commented Dec 28, 2023

Hi, I came across the same issue.

It took me hours to figure out why dvc pull shows the same failed to load directory error. Showing misleading error message really is a bad problem.

Also, I agree with his comment. It was just 3 weeks ago to came across another bug that broke dvc #10142. It is great that dvc is releasing new versions at least every week, but dvc needs to have a better regression test.

@efiop efiop added the research label Jan 2, 2024
@efiop efiop self-assigned this Jan 2, 2024
@efiop
Copy link
Contributor

efiop commented Jan 2, 2024

For the record: can reproduce. It happens because we try to dynamically load collected index, which we shouldn't need to do at that point. Working on a fix...

@efiop
Copy link
Contributor

efiop commented Jan 2, 2024

Finally, a broader context comment — while upgrading several projects from DVC 2.x or early 3.x to recent (3.28 and later?) DVC, it appears to me that the quality of DVC error messages, and robustness to storage errors, has gone down remarkably; from the stack traces, it looks like errors often happen at the new dvc-data layer, but propagate upwards in a way that causes the user-facing layer to misattribute their cause in the primary error message (for example, reporting an “unexpected error” in this case for a subclass of the condition of a remote not having all the data dvc expects to fetch). See also #10076, for example, or #10065 (which was not quite the same problem but also seemed to be a species of errors manifesting as unrelated-looking problems, at least in its user-visible effects). This is very frustrating, and it also makes it harder for me to teach others how to use DVC when the error messages do not accurately and actionably report the errors encountered. It also seems to be a serious regression from earlier versions of DVC, where storage errors were more directly reported and the software recovered more reliably (in my experience, at least, using it primarily in a non-cloud environment).

Thanks for the feedback, we truly appreciate it. We've accumulated a lot of things over the years that worked well enough in simple scenarios but broke down or was unmanageable in more complex ones (e.g. like cloud versioning support or imports), which resulted in us revisiting the architecture around 3.0 to make things more robust in the mid/longterm. That also includes better messages for multi-stage operations. Unfortunately, the way we've handled it with the resources available resulted in regressions that we are still fixing, not to mention that ui/ux is yet to be cleaned up, even though it now has a more robust architecture behind it that is a foundation for any other improvements.

I am personally really sorry that we've let the quality/stability drop and we are trying hard to improve things while also trying to deliver more advanced features like dvcfs that really benefit from the more robust architecture that we have today.

@mdekstrand @kbumsik the feedback you folks provide is essential to us, thank you so much for helping improve dvc.

@github-project-automation github-project-automation bot moved this from Todo to Done in DVC Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? p1-important Important, aka current backlog of things to do research
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants