Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get: load performance is significantly (many times) lower than for pull (reproduced on dvc-bench dataset) #6019

Closed
zimka opened this issue May 16, 2021 · 6 comments · Fixed by #6401
Assignees
Labels
performance improvement over resource / time consuming tasks

Comments

@zimka
Copy link

zimka commented May 16, 2021

Bug Report

get command for external repo works many times slower than clone + pull command for the same repo (checked on local remote). The same issue happens for the import command.
For reproducible toy dataset (1Gb, 10^4 files) there is a 5x difference, measured by cProfile (archive attached).
For my real data (which actually raised the issue in the first place) with 20Gb and 150k files the downgrade is huge: pull takes 30min, while get estimation was 350+hours!

Reproduce

I have created a test setup to create toy dataset with random binary files and measure the difference on them, but I think it can be reproduced from scratch on any data.

  1. Create dataset, it should contain enough data and files (1Gb and 10^4 file in my case)
  2. Set local remote, push data there, commit and push your repo on github
  3. Clone your repo, use pull to load data (measure)
  4. Use get to load data (measure)

I measured both ways with cProfile and get is 5x slower than pull.
cProfiles2.zip

The problem is, that it is probably not a constant ratio, but depends on the amount of files or data. For my real data with 160k files and 20Gb data, the get after 8 hours of exectuion(which by its own 10x longer) the ETA was 350+ hours, while pull took only 30min.
In my view, even if it is expected for get to work slow, such decrease is too bad to be bearable.

I have also noticed that CPU is used intensively during the pull and is not during the get execution. I tried to increase number of jobs for get in several times, but it has not affected the ETA anyhow.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.1.0 (pip)
---------------------------------
Platform: Python 3.7.6 on Windows-10-10.0.19041-SP0
Supports: http, https
Cache types: hardlink
Cache directory: NTFS on C:\
Caches: local
Remotes: local
Workspace directory: NTFS on C:\
Repo: dvc, git
@zimka zimka changed the title get: load performance is significantly (many times) lower than for pull with local rempte get: load performance is significantly (many times) lower than for pull with local remote May 16, 2021
@shcheklein shcheklein added the performance improvement over resource / time consuming tasks label May 16, 2021
@skshetry
Copy link
Member

Most likely this is happening because of #5546. cProfile log is not being very helpful here to point out the cause.

@dberenbaum
Copy link
Collaborator

Should get be included in dvc-bench?

@zimka
Copy link
Author

zimka commented May 18, 2021

@dberenbaum it looks like it is included (afaik import and get use similar code) and I have found some graphs.
However, on my PC execution time is drastically different from bench graphs:

  1. git clone + dvc pull data/cats_dogs.dvc: 5-10 mins (33 mins on the bench graph)
  2. dvc get https://github.com/iterative/dvc-bench data/cats_dogs.dvc: more 60 (5 mins on the bench graph)
    I'll check later if results are the same on Linux.

@zimka
Copy link
Author

zimka commented May 19, 2021

I have got the same results on Linux and dvc 2.0.17 with https://github.com/iterative/dvc-bench data/cats_dogs.dvc - which means that the problem is not specific to Windows or local storage only, and can be reproduced on already existent dataset in dvc-bench.
Also in my case dvc get used 4 cpus according to htop and the average load was similar to dvc pull, but File/s was much lower.
Not sure why my results are different from dvc-bench graphs.

@zimka zimka changed the title get: load performance is significantly (many times) lower than for pull with local remote get: load performance is significantly (many times) lower than for pull (reproduced on dvc-bench dataset) May 21, 2021
@efiop
Copy link
Contributor

efiop commented Jun 16, 2021

For the record: pre-requisites: #6109 and the following save()/transfer() unification. We'll double check this issue after those changes are merged, will likely be fixed automatically though. ETA for those so far is Jul ~1st,

@pmrowla
Copy link
Contributor

pmrowla commented Jul 13, 2021

As noted in #6109, dvc import performance is roughly equivalent to dvc pull in master now, but dvc get is still very slow. The issue with get performance is because we directly call repo_fs.download(path) which will just fs.open() and stream/copy all the files in a directory (whether the file ends up coming from git or from a DVC out).

repo_fs.download will need to be updated to do use the same object collection and save/transfer optimizations as import so that we don't do the individual stream/copy for everything inside DVC outs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants