remote: short-circuit remote size estimation for large remotes #3537

pmrowla · 2020-03-26T07:45:34Z

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.
❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Will close #3530

pmrowla · 2020-03-26T07:48:43Z

Blocked by #3532
This PR will need rebase after 3532 as they both touch the same code (but I think it makes sense to keep the UI and size estimation logic changes in separate PRs)

The estimation behavior changes also won't be obvious without the size estimation pbar from the UI PR

pmrowla · 2020-03-30T06:11:52Z

Short-circuit point (which depends on the # of local checksums) is now used as the upper bound for the Estimating size of ... progress bar. We would stop estimating at a size of 10M files in this example:

dvc/remote/base.py

efiop

🚀

shcheklein · 2020-03-30T16:37:37Z

dvc/remote/base.py

+        prefix = "0" * self.TRAVERSE_PREFIX_LEN
+        total_prefixes = pow(16, self.TRAVERSE_PREFIX_LEN)
+        if short_circuit:
+            max_remote_size = self._max_estimation_size(checksums)


minor: name is a bit confusing - probably should be max_threshold or something ... it's not remote size, right?

pared

Post-merge LGTM

shcheklein · 2020-03-30T16:46:04Z

dvc/remote/base.py

+        else:
+            max_remote_size = None
+
+        with Tqdm(


My 2cs:

let's check please GDrive (other remotes?) that it is indeed lazy (does not prefetch all pages for the prefix)

short circuitry itself is great, but I would probably keep UI as it was before ... it became even more confusing - when I see 1% - 2% -> done ... let's imagine we will have it counting to 20% and I users would be scared to see it since they would expect to wait until 100%?

code wise, not this PR - size of this file got out of hands already .... not sure what is the best practice in Python (are there?), how people support large files in a good shape? or split into helpers, mixins, etc? cc @efiop @Suor

@shcheklein It's lazy for gdrive. The only remote I wasn't sure about is hdfs. For hdfs we yield each item returned by hdfs.ls(<prefix>), but I don't think there's much else we can do in that case?

regarding the UI, I think the ideal thing would be to just use the current running estimated size without the total (so the way it was before this change) plus a spinner, as discussed in the other PR

@pmrowla agreed on both! (spinner itself can wait until we have something from tqdm for example, or can do minor expriments - align stats to the left or whatnot).

thanks for looking into this!

pmrowla self-assigned this Mar 26, 2020

pmrowla force-pushed the 3530 branch 2 times, most recently from dba4c53 to 7cefeea Compare March 30, 2020 05:52

remote: short-circuit remote size estimation for large remotes

caa5595

pmrowla force-pushed the 3530 branch from 7cefeea to caa5595 Compare March 30, 2020 06:15

pmrowla marked this pull request as ready for review March 30, 2020 06:15

pmrowla changed the title ~~[WIP] remote: short-circuit remote size estimation for large remotes~~ remote: short-circuit remote size estimation for large remotes Mar 30, 2020

pmrowla requested review from pared, efiop and skshetry March 30, 2020 06:16

efiop reviewed Mar 30, 2020

View reviewed changes

dvc/remote/base.py Show resolved Hide resolved

efiop approved these changes Mar 30, 2020

View reviewed changes

efiop merged commit d276ba4 into iterative:master Mar 30, 2020

shcheklein reviewed Mar 30, 2020

View reviewed changes

pared reviewed Mar 30, 2020

View reviewed changes

shcheklein reviewed Mar 30, 2020

View reviewed changes

pmrowla deleted the 3530 branch March 31, 2020 04:01

pmrowla mentioned this pull request Mar 31, 2020

remote: use progress bar for remote cache query status during dvc gc #3559

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remote: short-circuit remote size estimation for large remotes #3537

remote: short-circuit remote size estimation for large remotes #3537

pmrowla commented Mar 26, 2020 •

edited

Loading

pmrowla commented Mar 26, 2020

pmrowla commented Mar 30, 2020

efiop left a comment

shcheklein Mar 30, 2020

pared left a comment

shcheklein Mar 30, 2020

pmrowla Mar 31, 2020

shcheklein Mar 31, 2020

remote: short-circuit remote size estimation for large remotes #3537

remote: short-circuit remote size estimation for large remotes #3537

Conversation

pmrowla commented Mar 26, 2020 • edited Loading

pmrowla commented Mar 26, 2020

pmrowla commented Mar 30, 2020

efiop left a comment

Choose a reason for hiding this comment

shcheklein Mar 30, 2020

Choose a reason for hiding this comment

pared left a comment

Choose a reason for hiding this comment

shcheklein Mar 30, 2020

Choose a reason for hiding this comment

pmrowla Mar 31, 2020

Choose a reason for hiding this comment

shcheklein Mar 31, 2020

Choose a reason for hiding this comment

pmrowla commented Mar 26, 2020 •

edited

Loading