-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote: short-circuit remote size estimation for large remotes #3537
Conversation
Blocked by #3532 The estimation behavior changes also won't be obvious without the size estimation pbar from the UI PR |
dba4c53
to
7cefeea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π
prefix = "0" * self.TRAVERSE_PREFIX_LEN | ||
total_prefixes = pow(16, self.TRAVERSE_PREFIX_LEN) | ||
if short_circuit: | ||
max_remote_size = self._max_estimation_size(checksums) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: name is a bit confusing - probably should be max_threshold
or something ... it's not remote size, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Post-merge LGTM
else: | ||
max_remote_size = None | ||
|
||
with Tqdm( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2cs:
- let's check please GDrive (other remotes?) that it is indeed lazy (does not prefetch all pages for the prefix)
- short circuitry itself is great, but I would probably keep UI as it was before ... it became even more confusing - when I see 1% - 2% -> done ... let's imagine we will have it counting to 20% and I users would be scared to see it since they would expect to wait until 100%?
- code wise, not this PR - size of this file got out of hands already .... not sure what is the best practice in Python (are there?), how people support large files in a good shape? or split into helpers, mixins, etc? cc @efiop @Suor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shcheklein It's lazy for gdrive. The only remote I wasn't sure about is hdfs. For hdfs we yield
each item returned by hdfs.ls(<prefix>)
, but I don't think there's much else we can do in that case?
regarding the UI, I think the ideal thing would be to just use the current running estimated size without the total (so the way it was before this change) plus a spinner, as discussed in the other PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmrowla agreed on both! (spinner itself can wait until we have something from tqdm for example, or can do minor expriments - align stats to the left or whatnot).
thanks for looking into this!
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.
β I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)
Thank you for the contribution - we'll try to review it as soon as possible. π
Will close #3530