Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: short-circuit remote size estimation for large remotes #3537

Merged
merged 1 commit into from
Mar 30, 2020

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Mar 26, 2020

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Will close #3530

@pmrowla pmrowla self-assigned this Mar 26, 2020
@pmrowla
Copy link
Contributor Author

pmrowla commented Mar 26, 2020

Blocked by #3532
This PR will need rebase after 3532 as they both touch the same code (but I think it makes sense to keep the UI and size estimation logic changes in separate PRs)

The estimation behavior changes also won't be obvious without the size estimation pbar from the UI PR

@pmrowla pmrowla force-pushed the 3530 branch 2 times, most recently from dba4c53 to 7cefeea Compare March 30, 2020 05:52
@pmrowla
Copy link
Contributor Author

pmrowla commented Mar 30, 2020

Short-circuit point (which depends on the # of local checksums) is now used as the upper bound for the Estimating size of ... progress bar. We would stop estimating at a size of 10M files in this example:

asciicast

@pmrowla pmrowla marked this pull request as ready for review March 30, 2020 06:15
@pmrowla pmrowla changed the title [WIP] remote: short-circuit remote size estimation for large remotes remote: short-circuit remote size estimation for large remotes Mar 30, 2020
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸš€

@efiop efiop merged commit d276ba4 into iterative:master Mar 30, 2020
prefix = "0" * self.TRAVERSE_PREFIX_LEN
total_prefixes = pow(16, self.TRAVERSE_PREFIX_LEN)
if short_circuit:
max_remote_size = self._max_estimation_size(checksums)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: name is a bit confusing - probably should be max_threshold or something ... it's not remote size, right?

Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post-merge LGTM

else:
max_remote_size = None

with Tqdm(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2cs:

  • let's check please GDrive (other remotes?) that it is indeed lazy (does not prefetch all pages for the prefix)
  • short circuitry itself is great, but I would probably keep UI as it was before ... it became even more confusing - when I see 1% - 2% -> done ... let's imagine we will have it counting to 20% and I users would be scared to see it since they would expect to wait until 100%?
  • code wise, not this PR - size of this file got out of hands already .... not sure what is the best practice in Python (are there?), how people support large files in a good shape? or split into helpers, mixins, etc? cc @efiop @Suor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein It's lazy for gdrive. The only remote I wasn't sure about is hdfs. For hdfs we yield each item returned by hdfs.ls(<prefix>), but I don't think there's much else we can do in that case?

regarding the UI, I think the ideal thing would be to just use the current running estimated size without the total (so the way it was before this change) plus a spinner, as discussed in the other PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmrowla agreed on both! (spinner itself can wait until we have something from tqdm for example, or can do minor expriments - align stats to the left or whatnot).

thanks for looking into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remote: short-circuit remote size estimation for large remotes
4 participants