Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: Optimize traverse/no_traverse behavior #3501

Merged
merged 11 commits into from
Mar 25, 2020
Merged

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Mar 18, 2020

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

  • Estimate remote file count by fetching a single parent cache dir, then
    determine whether or not to use no_traverse method for checking
    remainder of cache entries in cache_exists
  • Thread requests when traversing full remote file lists (by fetching
    one parent cache dir per thread)

Will close #3488.

TODO:

  • Azure
  • S3
  • GDrive
  • Google CS
  • HDFS
  • OSS
  • Tests
  • Docs

@pmrowla
Copy link
Contributor Author

pmrowla commented Mar 18, 2020

Currently only tested on S3

dvc push runtime comparisons for S3 remote with 1M cache total entries, and 100k local files where /usr/local/bin/dvc is 0.88.0 via homebrew, and dvc is from git:

0.88.0 no_traverse (default behavior)

remote-100k-files git:master ✢ ✩  py:dvc ❯ time /usr/local/bin/dvc push
 Everything is up to date.
/usr/local/bin/dvc push  421.31s user 25.58s system 37% cpu 19:58.64 total

git (master w/forced traverse)

remote-100k-files git:master ✢ ✩  py:dvc ❯ time dvc push
Everything is up to date.
dvc push  205.87s user 7.83s system 29% cpu 11:53.41 total

git (3488 branch w/new behavior)

remote-100k-files git:master ✢ ✩  py:dvc ❯ time dvc push
Everything is up to date.
dvc push  225.73s user 14.12s system 98% cpu 4:02.43 total

When testing with 1k and 10k local files, no_traverse is always faster for S3 on my machine

@pmrowla
Copy link
Contributor Author

pmrowla commented Mar 19, 2020

Should be able to start running gdrive benchmarks tomorrow, currently waiting on a push to populate a test remote to finish overnight

@pmrowla pmrowla self-assigned this Mar 19, 2020
- estimate remote file count by fetching a single parent cache dir, then
  determine whether or not to use no_traverse method for checking
  remainder of cache entries in `cache_exists`
- thread requests when traversing full remote file lists (by fetching
  one parent cache dir per thread)
@pmrowla
Copy link
Contributor Author

pmrowla commented Mar 20, 2020

Benchmarks

0.88.0 from homebrew vs git:3488 branch
All times measured with

time dvc push

for an up-to-date repo (so no data needs to be pushed, and cache_exists is the main runtime factor)

S3

dvc no_traverse local files remote files runtime
0.88.0 true 1k 1k 10.07s user 1.47s system 61% cpu 18.733 total
0.88.0 false 1k 1k 1.79s user 0.67s system 34% cpu 7.096 total
git n/a 1k 1k 1.73s user 0.57s system 44% cpu 5.181 total
0.88.0 true 10k 100k 44.44s user 3.28s system 39% cpu 2:01.46 total
0.88.0 false 10k 100k 21.48s user 1.42s system 18% cpu 2:07.25 total
git n/a 10k 100k 24.64s user 1.71s system 91% cpu 28.717 total
0.88.0 true 100k 1M 421.31s user 25.58s system 37% cpu 19:58.64 total
git:master false 100k 1M 205.87s user 7.83s system 29% cpu 11:53.41 total
git n/a 100k 1M 225.73s user 14.12s system 98% cpu 4:02.43 total

GDrive

dvc no_traverse local files remote files runtime
0.88.0 false 10k 500k >30min (killed before finished running)
git n/a 10k 500k 78.13s user 4.94s system 57% cpu 2:23.25 total
git n/a 100k 500k 61.82s user 6.66s system 43% cpu 2:36.76 total

HDFS

Tested w/local single node hadoop instance inside virtualbox, dvc 0.87.0 from deb

dvc no_traverse local files remote files runtime
0.87.0 true 10k 100k real 0m18.388s user 0m20.496s sys 0m5.096s
git n/a 10k 100k real 0m17.542s user 0m20.316s sys 0m2.292s
0.87.0 true 100k 100k real 2m34.512s user 2m9.272s sys 0m32.828s
git n/a 100k 100k real 0m29.858s user 0m31.596 sys 0m3.096

- default to 3
- for remotes that only support per-directory query use 2 (gdrive, hdfs)
@pmrowla pmrowla marked this pull request as ready for review March 23, 2020 05:32
@pmrowla pmrowla changed the title [WIP] remote: Optimize traverse/no_traverse behavior remote: Optimize traverse/no_traverse behavior Mar 23, 2020
@pmrowla pmrowla requested a review from efiop March 23, 2020 05:33
@pared pared self-requested a review March 23, 2020 14:00
dvc/remote/base.py Show resolved Hide resolved
dvc/remote/azure.py Outdated Show resolved Hide resolved
dvc/remote/azure.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/hdfs.py Outdated Show resolved Hide resolved
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Please see a few minor questions/comments.

dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really great! Just a few comments/suggestions. Also CI fails for some reason?

@shcheklein
Copy link
Member

suggestion, btw - does it make sense to short-cut on a certain number of checksums? like it's clear that if it's one we can always to a singe exist request?

@pmrowla
Copy link
Contributor Author

pmrowla commented Mar 24, 2020

suggestion, btw - does it make sense to short-cut on a certain number of checksums? like it's clear that if it's one we can always to a singe exist request?

I think it makes sense to short-cut for small numbers of checksums, but I'm not sure where to set the cutoff since I don't have a great idea of what the typical remote use case looks like (which affects how many requests the autodetection/size estimation takes)

dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
@efiop
Copy link
Contributor

efiop commented Mar 24, 2020

@pmrowla Looks great! The only question to figure out is #3501 (comment) , the rest of my above comments are just tiny nitpicks.

To be honest, extremely eager to merge this and try it out with imagenet πŸ˜‰

- remove RemoteBASE.DEFAULT_NO_TRAVERSE and replace it with
  RemoteBASE.CAN_TRAVERSE (True for all remotes except http)
Copy link
Contributor Author

@pmrowla pmrowla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing SSHMocked test occurred because the old test explicitly configured the no_traverse option to be false. After obsoleting no_traverse, the test needed to be modified to set RemoteSSH.CAN_TRAVERSE to false to get the same behavior

@efiop efiop merged commit 9b38aa4 into iterative:master Mar 25, 2020
@efiop
Copy link
Contributor

efiop commented Mar 25, 2020

Thank you @pmrowla ! πŸ™ SSH is a tricky beast here, because it is way too sensitive to the number of connections that it can handle. Let's see how it goes in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remote: improve traverse/no_traverse behavior
5 participants