get: load performance is significantly (many times) lower than for pull (reproduced on dvc-bench dataset) #6019

zimka · 2021-05-16T16:40:00Z

Bug Report

get command for external repo works many times slower than clone + pull command for the same repo (checked on local remote). The same issue happens for the import command.
For reproducible toy dataset (1Gb, 10^4 files) there is a 5x difference, measured by cProfile (archive attached).
For my real data (which actually raised the issue in the first place) with 20Gb and 150k files the downgrade is huge: pull takes 30min, while get estimation was 350+hours!

Reproduce

I have created a test setup to create toy dataset with random binary files and measure the difference on them, but I think it can be reproduced from scratch on any data.

Create dataset, it should contain enough data and files (1Gb and 10^4 file in my case)
Set local remote, push data there, commit and push your repo on github
Clone your repo, use pull to load data (measure)
Use get to load data (measure)

I measured both ways with cProfile and get is 5x slower than pull.
cProfiles2.zip

The problem is, that it is probably not a constant ratio, but depends on the amount of files or data. For my real data with 160k files and 20Gb data, the get after 8 hours of exectuion(which by its own 10x longer) the ETA was 350+ hours, while pull took only 30min.
In my view, even if it is expected for get to work slow, such decrease is too bad to be bearable.

I have also noticed that CPU is used intensively during the pull and is not during the get execution. I tried to increase number of jobs for get in several times, but it has not affected the ETA anyhow.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.1.0 (pip)
---------------------------------
Platform: Python 3.7.6 on Windows-10-10.0.19041-SP0
Supports: http, https
Cache types: hardlink
Cache directory: NTFS on C:\
Caches: local
Remotes: local
Workspace directory: NTFS on C:\
Repo: dvc, git

The text was updated successfully, but these errors were encountered:

skshetry · 2021-05-18T15:11:10Z

Most likely this is happening because of #5546. cProfile log is not being very helpful here to point out the cause.

dberenbaum · 2021-05-18T17:54:27Z

Should get be included in dvc-bench?

zimka · 2021-05-18T18:47:04Z

@dberenbaum it looks like it is included (afaik import and get use similar code) and I have found some graphs.
However, on my PC execution time is drastically different from bench graphs:

git clone + dvc pull data/cats_dogs.dvc: 5-10 mins (33 mins on the bench graph)
dvc get https://github.com/iterative/dvc-bench data/cats_dogs.dvc: more 60 (5 mins on the bench graph)
I'll check later if results are the same on Linux.

zimka · 2021-05-19T07:27:18Z

I have got the same results on Linux and dvc 2.0.17 with https://github.com/iterative/dvc-bench data/cats_dogs.dvc - which means that the problem is not specific to Windows or local storage only, and can be reproduced on already existent dataset in dvc-bench.
Also in my case dvc get used 4 cpus according to htop and the average load was similar to dvc pull, but File/s was much lower.
Not sure why my results are different from dvc-bench graphs.

efiop · 2021-06-16T18:43:16Z

For the record: pre-requisites: #6109 and the following save()/transfer() unification. We'll double check this issue after those changes are merged, will likely be fixed automatically though. ETA for those so far is Jul ~1st,

pmrowla · 2021-07-13T03:33:04Z

As noted in #6109, dvc import performance is roughly equivalent to dvc pull in master now, but dvc get is still very slow. The issue with get performance is because we directly call repo_fs.download(path) which will just fs.open() and stream/copy all the files in a directory (whether the file ends up coming from git or from a DVC out).

repo_fs.download will need to be updated to do use the same object collection and save/transfer optimizations as import so that we don't do the individual stream/copy for everything inside DVC outs

zimka changed the title ~~get: load performance is significantly (many times) lower than for pull with local rempte~~ get: load performance is significantly (many times) lower than for pull with local remote May 16, 2021

shcheklein added the performance improvement over resource / time consuming tasks label May 16, 2021

zimka changed the title ~~get: load performance is significantly (many times) lower than for pull with local remote~~ get: load performance is significantly (many times) lower than for pull (reproduced on dvc-bench dataset) May 21, 2021

efiop mentioned this issue Jun 16, 2021

DVC import file hashing only runs on one CPU thread #5546

Closed

efiop mentioned this issue Aug 9, 2021

repofs: use underlying fs.download to download files #6401

Merged

2 tasks

efiop self-assigned this Aug 9, 2021

efiop closed this as completed in #6401 Aug 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get: load performance is significantly (many times) lower than for pull (reproduced on dvc-bench dataset) #6019

get: load performance is significantly (many times) lower than for pull (reproduced on dvc-bench dataset) #6019

zimka commented May 16, 2021 •

edited

Loading

skshetry commented May 18, 2021

dberenbaum commented May 18, 2021

zimka commented May 18, 2021 •

edited

Loading

zimka commented May 19, 2021

efiop commented Jun 16, 2021 •

edited

Loading

pmrowla commented Jul 13, 2021

get: load performance is significantly (many times) lower than for pull (reproduced on dvc-bench dataset) #6019

get: load performance is significantly (many times) lower than for pull (reproduced on dvc-bench dataset) #6019

Comments

zimka commented May 16, 2021 • edited Loading

Bug Report

Reproduce

Environment information

skshetry commented May 18, 2021

dberenbaum commented May 18, 2021

zimka commented May 18, 2021 • edited Loading

zimka commented May 19, 2021

efiop commented Jun 16, 2021 • edited Loading

pmrowla commented Jul 13, 2021

zimka commented May 16, 2021 •

edited

Loading

zimka commented May 18, 2021 •

edited

Loading

efiop commented Jun 16, 2021 •

edited

Loading