Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: skip non-cache looking files when estimating remote size #3557

Merged
merged 3 commits into from
Mar 31, 2020

Conversation

efiop
Copy link
Contributor

@efiop efiop commented Mar 31, 2020

dvc push and other commands will fail if they find a file that doesn't look like a cache file in the remote. The issue is that we've started using list_cache_paths for remote size estimation, which doesn't skip non-cache looking files. We should use all() instead.

Discord context: https://discordapp.com/channels/485586884165107732/485596304961962003/694454999492591657

  • ❗ I have followed the Contributing to DVC checklist.

  • 📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@efiop efiop requested a review from pmrowla March 31, 2020 08:21
@@ -697,17 +697,18 @@ def checksum_to_path_info(self, checksum):
def list_cache_paths(self, prefix=None, progress_callback=None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty much a walk_files method, but it is not supported in every remote yet, so we are using a legacy implementation for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using walk_files would've been much more intuitive, esp since walk presumes that it is an iterator, unlike list_cache_paths which could be anything.

@efiop efiop changed the title Discord manuel remote: skip non-cache looking files when estimating remote size Mar 31, 2020
@efiop
Copy link
Contributor Author

efiop commented Mar 31, 2020

Oops, I think I forgot to change the places that expect cache paths but get hashes. Looking into it.

`list_cache_paths()` doesn't ignore paths that don't look like hashes,
which makes dvc raise an exception when it tries to convert such paths
to hashes. We should use `all()` instead.

Discord context:
https://discordapp.com/channels/485586884165107732/485596304961962003/694454282719592458
This will allow us to see it from verbose logs coming from users.
Copy link
Contributor

@pmrowla pmrowla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fix the bug for now.

In the changes for #3543 it will be cleaned up a bit more to keep the separation between all() returning all checksums in the remote and list_cache_paths() allowing filtering by prefix

@@ -936,13 +935,13 @@ def update(n=1):
pbar.update(n * total_prefixes)

if max_remote_size:
paths = self._cache_paths_with_max(
checksums = self._all_with_limit(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe simply islice()?

@efiop efiop merged commit 38e818b into iterative:master Mar 31, 2020
@efiop efiop deleted the discord-manuel branch March 31, 2020 09:10
@efiop efiop self-assigned this Apr 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants