-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote: Optimize traverse/no_traverse behavior #3501
Conversation
Currently only tested on S3
git (master w/forced traverse)
git (3488 branch w/new behavior)
When testing with 1k and 10k local files, |
Should be able to start running gdrive benchmarks tomorrow, currently waiting on a push to populate a test remote to finish overnight |
- estimate remote file count by fetching a single parent cache dir, then determine whether or not to use no_traverse method for checking remainder of cache entries in `cache_exists` - thread requests when traversing full remote file lists (by fetching one parent cache dir per thread)
Benchmarks
for an up-to-date repo (so no data needs to be pushed, and S3
GDrive
HDFSTested w/local single node hadoop instance inside virtualbox, dvc
|
- default to 3 - for remotes that only support per-directory query use 2 (gdrive, hdfs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Please see a few minor questions/comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really great! Just a few comments/suggestions. Also CI fails for some reason?
suggestion, btw - does it make sense to short-cut on a certain number of checksums? like it's clear that if it's one we can always to a singe exist request? |
I think it makes sense to short-cut for small numbers of checksums, but I'm not sure where to set the cutoff since I don't have a great idea of what the typical remote use case looks like (which affects how many requests the autodetection/size estimation takes) |
@pmrowla Looks great! The only question to figure out is #3501 (comment) , the rest of my above comments are just tiny nitpicks. To be honest, extremely eager to merge this and try it out with imagenet π |
- remove RemoteBASE.DEFAULT_NO_TRAVERSE and replace it with RemoteBASE.CAN_TRAVERSE (True for all remotes except http)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing SSHMocked test occurred because the old test explicitly configured the no_traverse
option to be false. After obsoleting no_traverse
, the test needed to be modified to set RemoteSSH.CAN_TRAVERSE
to false to get the same behavior
Thank you @pmrowla ! π SSH is a tricky beast here, because it is way too sensitive to the number of connections that it can handle. Let's see how it goes in the future. |
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.
β I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)
Thank you for the contribution - we'll try to review it as soon as possible. π
determine whether or not to use no_traverse method for checking
remainder of cache entries in
cache_exists
one parent cache dir per thread)
Will close #3488.
TODO: