Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: use string paths over PathInfo for performance reasons #3672

Merged
merged 9 commits into from
Apr 28, 2020

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Apr 24, 2020

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Should close #3635.

  • LocalRemote now uses string paths instead of PathInfo in places where we expect to be dealing with a large number of paths (status and LocalRemote.cache_exists)
  • _process/upload/download pipeline has been left alone for now - in the event that we are pushing or pulling enough files that using PathInfo's becomes noticeably slow, the expectation is already going to be that the operation will be slow, given that we will be uploading/downloading a lot of data.

@pmrowla pmrowla changed the title remote: use string paths over PathInfo for performance reasons [WIP] remote: use string paths over PathInfo for performance reasons Apr 24, 2020
@pmrowla pmrowla force-pushed the 3635-remote-local-pathinfo branch from 9cb7a0f to 654b606 Compare April 24, 2020 02:34
@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 24, 2020

_cloud_status for 2M files is down to 159s from 938s in #3634

Screen Shot 2020-04-24 at 11 25 37 AM

@pmrowla pmrowla force-pushed the 3635-remote-local-pathinfo branch from 75cffe8 to 9d4a3b8 Compare April 24, 2020 04:37
@pmrowla pmrowla changed the title [WIP] remote: use string paths over PathInfo for performance reasons remote: use string paths over PathInfo for performance reasons Apr 24, 2020
@pmrowla pmrowla requested a review from efiop April 24, 2020 05:49
@pmrowla pmrowla self-assigned this Apr 24, 2020
@pmrowla pmrowla added the performance improvement over resource / time consuming tasks label Apr 24, 2020
@pmrowla pmrowla changed the title remote: use string paths over PathInfo for performance reasons [WIP] remote: use string paths over PathInfo for performance reasons Apr 24, 2020
@pmrowla pmrowla force-pushed the 3635-remote-local-pathinfo branch from 0f69c29 to c4863b1 Compare April 27, 2020 04:10
pmrowla added 3 commits April 27, 2020 15:11
- cloud remotes still default to using PathInfo's
- if path is not relpath from cwd or abspath, posix lstat() syscall
  runtime doubles (from calculating relpath from cwd)
@pmrowla pmrowla force-pushed the 3635-remote-local-pathinfo branch from b17446f to 33ec6d7 Compare April 27, 2020 06:12
@@ -67,6 +67,13 @@ def cache_dir(self, value):
def supported(cls, config):
return True

@cached_property
def cache_path(self):
return os.path.abspath(self.cache_dir)
Copy link
Contributor Author

@pmrowla pmrowla Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If path is not abspath or relpath from current working dir, posix lstat() syscall runtime doubles. PathInfo.__str__ uses relpath from cwd, but since cwd may change during runtime (like in some of our erepo tests), we can't cache relpath(self.cache_dir) and we don't want to compute it each time, so we just use abspath.

@pmrowla pmrowla changed the title [WIP] remote: use string paths over PathInfo for performance reasons remote: use string paths over PathInfo for performance reasons Apr 27, 2020
@pmrowla
Copy link
Contributor Author

pmrowla commented Apr 27, 2020

@efiop solution is more generalized now. remote.checksum_to_path has been re-added to return string path on remotes which can benefit from it performance-wise (currently only local). On other remotes it just points to checksum_to_path_info

@efiop efiop merged commit 6d8499e into iterative:master Apr 28, 2020
@pmrowla pmrowla deleted the 3635-remote-local-pathinfo branch April 28, 2020 12:22
skshetry added a commit to skshetry/dvc that referenced this pull request Jun 14, 2020
PR iterative#3672 (6d8499e) extended `LocalRemote::_get_plans` to return one
`checksums` too. As all of the args from `_get_plans` was passed
down to `download()`, it recognized extra arg of checksum as
`no_progress_bar` due to which it became True and stopped showing
progress bar at all.

Fix iterative#3874
efiop pushed a commit that referenced this pull request Jun 14, 2020
PR #3672 (6d8499e) extended `LocalRemote::_get_plans` to return one
`checksums` too. As all of the args from `_get_plans` was passed
down to `download()`, it recognized extra arg of checksum as
`no_progress_bar` due to which it became True and stopped showing
progress bar at all.

Fix #3874
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remote: loading dirs is exteremely slow because of PathInfo
2 participants