-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get: load performance is significantly (many times) lower than for pull (reproduced on dvc-bench dataset) #6019
Comments
Most likely this is happening because of #5546. |
Should get be included in dvc-bench? |
@dberenbaum it looks like it is included (afaik
|
I have got the same results on Linux and dvc 2.0.17 with https://github.com/iterative/dvc-bench data/cats_dogs.dvc - which means that the problem is not specific to Windows or local storage only, and can be reproduced on already existent dataset in dvc-bench. |
For the record: pre-requisites: #6109 and the following |
As noted in #6109,
|
Bug Report
get
command for external repo works many times slower than clone +pull
command for the same repo (checked on local remote). The same issue happens for theimport
command.For reproducible toy dataset (1Gb, 10^4 files) there is a 5x difference, measured by cProfile (archive attached).
For my real data (which actually raised the issue in the first place) with 20Gb and 150k files the downgrade is huge:
pull
takes 30min, whileget
estimation was 350+hours!Reproduce
I have created a test setup to create toy dataset with random binary files and measure the difference on them, but I think it can be reproduced from scratch on any data.
local
remote,push
data there, commit and push your repo on githubpull
to load data (measure)get
to load data (measure)I measured both ways with cProfile and
get
is 5x slower thanpull
.cProfiles2.zip
The problem is, that it is probably not a constant ratio, but depends on the amount of files or data. For my real data with 160k files and 20Gb data, the
get
after 8 hours of exectuion(which by its own 10x longer) the ETA was 350+ hours, whilepull
took only 30min.In my view, even if it is expected for
get
to work slow, such decrease is too bad to be bearable.I have also noticed that CPU is used intensively during the
pull
and is not during theget
execution. I tried to increase number of jobs forget
in several times, but it has not affected the ETA anyhow.Environment information
Output of
dvc doctor
:The text was updated successfully, but these errors were encountered: