Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

status: it's slow #5594

Closed
jorgeorpinel opened this issue Mar 11, 2021 · 21 comments
Closed

status: it's slow #5594

jorgeorpinel opened this issue Mar 11, 2021 · 21 comments
Labels
A: status Related to the dvc diff/list/status performance improvement over resource / time consuming tasks

Comments

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Mar 11, 2021

Bug Report

Description

status seems slow.

Reproduce

$ git clone [email protected]:iterative/example-get-started.git
$ cd example-get-started
$ dvc pull
$ dvc status
Cloning

It lingers there for a few seconds, before changing to Data and pipelines are up to date.

Before 2.0 this was pretty much instantaneous. A few users have reported this on Discord BTW (2 in #q-and-a earlier today, one of them mentioned add being slow too). @efiop mentioned it could be related to the new Dulwich implementation.

Expected

Instantaneous report, esp. for such a simple project as in the example above.

Environment information

Output of dvc doctor:

$ dvc version
DVC version: 2.0.5
---------------------------------
Platform: Python 3.6.9 on Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-Ubuntu-18.04-bionic
Supports: All remotes
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdb
Caches: local
Remotes: https
Workspace directory: ext4 on /dev/sdb
Repo: dvc, git

Additional Information (if any):

@jorgeorpinel jorgeorpinel added the performance improvement over resource / time consuming tasks label Mar 11, 2021
@pmrowla
Copy link
Contributor

pmrowla commented Mar 11, 2021

Based on the discord discussion it's probably because of #5544

We can revert the status check change for now, I'll look into seeing if using pygit2's status implementation is faster

@pmrowla pmrowla self-assigned this Mar 11, 2021
@pmrowla pmrowla added the p0-critical Critical issue. Needs to be fixed ASAP. label Mar 11, 2021
@jorgeorpinel
Copy link
Contributor Author

Is that done during Cloning? Does it affect add too? Thanks

@pmrowla
Copy link
Contributor

pmrowla commented Mar 11, 2021

It's not related to cloning. It's pretty much done whenever we load stages and yes it's what is also affecting the add performance. Or at least, the issue that should be resolved by the linked PR is definitely the cause of the add and repro performance issues. There maybe an unrelated status issue as well though?

@pmrowla
Copy link
Contributor

pmrowla commented Mar 11, 2021

Ok, so it looks like the status issue is actually separate. It's still a dulwich/git issue, but related to gitignore performance.

dvc status in the DVC git repo:

Screen Shot 2021-03-11 at 2 36 46 PM

The add/repro performance issue should maybe go into its own GH issue, as it's related to the git status/git-track reminder (although gitignore performance probably also affects those commands as well)

@pmrowla pmrowla added p1-important Important, aka current backlog of things to do and removed p0-critical Critical issue. Needs to be fixed ASAP. labels Mar 11, 2021
@pmrowla pmrowla removed their assignment Mar 11, 2021
@skshetry
Copy link
Member

@pmrowla, is that happening on an external repo? It should only be in effect for local repository.

@skshetry
Copy link
Member

For me, it's quite fast for example-get-started.
Screenshot from 2021-03-11 11-32-56

@pmrowla
Copy link
Contributor

pmrowla commented Mar 11, 2021

@skshetry the gitignore cprofile stuff that I posted is from running dvc status inside the DVC git repo (which has several gitignore patterns to match)

In example-get-started (which has almost none) it is also fast for me.

Using the 1.11 branch, dvc status takes ~1/3 the time to run in the DVC git repo for me:

Screen Shot 2021-03-11 at 2 51 07 PM

@alealv
Copy link

alealv commented Apr 20, 2021

👋

I'm working with a huge dataset (5TB) and running status or commit command is becoming a bit annoying. I would expect them to take a lot of time but it's taken ages (I mean days to process) and before computing hashes, they don't give any output whatsoever. Hence I have no way of knowing if the process is dead or what.

So I decided to check if there is any way of improving the code and I found this thread.

Profiling

I've profiled the status command as done above but only for two different stages of my pipeline:

  1. The first one processes ~106GB of data (~292370 files). It uses the LibriSpeech ASR dataset
    profiling_librispeech_1
    profiling_librispeech_2

  2. The second processes ~50GB of data (~268270 files). It uses the TED-LIUM dataset
    profiling_1
    profiling_2

So I noticed two things:

  1. Most of the time is spent in a deadlock (code section)
  2. The other big chunk is because of os.walk (code section)

Comparing times with Bash

I've done the comparison only for the TED-LIUM dataset (second process)

  • List files:
aalvarez@ml1:/mnt/data/dvc/tmp/data/tedlium_waves$ time ls . > /dev/null

real    0m2,656s
user    0m0,607s
sys     0m0,350s

aalvarez@ml1:~$ time find /mnt/data/dvc/tmp/data/tedlium_waves -type f > /dev/null

real    0m26,133s
user    0m0,222s
sys     0m3,540s

So comparing to os.walk this takes significantly less time
DVC --> 538s/60=8.966m

  • Computing hashes
aalvarez@ml1:~$ time find /mnt/data/dvc/tmp/data/tedlium_waves -type f -exec md5sum {} \; > /dev/null

real    12m4,800s
user    5m2,927s
sys     3m47,566s

aalvarez@ml1:~$ time find /mnt/data/dvc/tmp/data/tedlium_waves -type f | parallel --progress md5sum > /dev/null

Computers / CPU cores / Max jobs to run
1:local / 36 / 36

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/268263/100%/0.0s

real    7m38,746s
user    10m26,780s
sys     9m43,082s

And hash computation is more or less the same depending on which number you take, but still above what it can be accomplished.
DVC --> 660s/60=11m

Propositon

  1. Truly parallelism would come by the hand of a multiprocessing.Pool instead of ThreadPoolExecutor, also I believed that it would be best to use imap_unorder
  2. Maybe this is going too deep but looking at the os.walk code it first appends everything to a list and then it returns. Wouldn't be better to use scandir() and yield?

@alealv
Copy link

alealv commented Jul 21, 2021

Hi, are there any updates on this?

@efiop
Copy link
Contributor

efiop commented Jul 21, 2021

@alealv Could you show dvc doctor too, please?

As to progress from our side, we didn't have a chance to look into this specific issue yet, but have been working on some pre-requisites (e.g. dulwich). If you could post new results with the recent dvc version, that would be really helpful.

@alealv
Copy link

alealv commented Jul 22, 2021

This is the dvc doctor output

➜ dvc doctor
DVC version: 2.5.4 (pip)
---------------------------------
Platform: Python 3.8.5 on Linux-5.4.0-71-generic-x86_64-with-glibc2.29
Supports:
        http (requests = 2.25.1),
        https (requests = 2.25.1)
Cache types: symlink
Cache directory: nfs4 on 10.15.0.20:/volume1/MLData
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/mapper/ubuntu--vg-ubuntu--lv
Repo: dvc (subdir), git

@efiop
Copy link
Contributor

efiop commented Jul 22, 2021

@alealv Do you get the same results with this latest version as you did a few months ago?

@alealv
Copy link

alealv commented Jul 22, 2021

I haven't run the same tests, but I can say that I ran dvc status (for the whole 5TB dataset) for more than 3 hours and it didn't finish. Also dvc commit run for more than 5 days and I stopped it because I didn't know if it was stalled or what.

It would be best if we coordinate on which tests to perform or create a specific repo to reproduce this.

@efiop
Copy link
Contributor

efiop commented Jul 22, 2021

@alealv If you could add --cprofile-dump dump.prof to any of those commands and share that file here - that would be enough for us to take a closer look.

@alealv
Copy link

alealv commented Aug 25, 2021

Hi there! After almost one month the process finished.

dump.prof.gz

@efiop
Copy link
Contributor

efiop commented Aug 25, 2021

@alealv Whoa, that's crazy! Thank you so much for for the profiling dump! 🙏 I feel really bad for not clarifying that you could've tried this with a smaller dataset and thus wasting so much of your time 🙁

It finished successfully, right? And you were using 2.5.4? Just clarifying

What I can see from the dump right away is that stat is around 100 times slower than on my machine, which is usually because of nfs and in your case it probably is 🙁 But we need to take a closer look.

Good news here is that since the last time, we've reworked our object-related logic and we've reworked our benchmarking framework , where we will be adding a similar dataset size to yours (from logs, it looks like your dataset is ~1.7 million of files, right?) to keep an eye on this. We are also working on some optimizations right now for status and diff, as well as we have an NFS-related epic coming up in september.

@alealv
Copy link

alealv commented Aug 26, 2021

It finished successfully, right? And you were using 2.5.4? Just clarifying

Indeed

What I can see from the dump right away is that stat is around 100 times slower than on my machine, which is usually because of nfs and in your case it probably is But we need to take a closer look.

Yes, the cache is located on an NFS.

(from logs, it looks like your dataset is ~1.7 million of files, right?)

Actually, it is 12,601,130

Good news here is that since the last time...

Cool, that's great.

@alealv
Copy link

alealv commented Dec 2, 2021

Hi me again!

I've been using DVC for a new repository and I notice that the same dependencies are checked each time they appear on a stage. I would expect this to be checked only once and that it assumes the state hasn't changed for the following stages.

The only important remark is that I've settled cache: false to data dependencies.

Here is a dump of the verbose output of dvc status and also the profiling result
https://we.tl/t-ZjgFk3KBdo

➜ dvc --version
2.8.3

@alealv
Copy link

alealv commented Mar 3, 2022

So, this week I worked again with this issue and I tried to explore in more depth DVC code.

I saw that there has been some improvements over the code and that you are now using fsspec. Also, I see that there is still a ThreadPoolExecutor AFAIK, this is to allow asynchronous reads on multiple files and avoid being blocked by the read operation.

I was wondering if this could be improve by using cooperative multitasking instead. I've read that Fsspec has async option for some FS mainly for LocalFS, MemoryFS, HTTPFS.

Furthermore, this could be scale also with the available cores in the machine, one producer the os.walk and at least one consumer which will get the hash value (more on this in this interesting talk)

PD: For local file system there this interesting asyncio library. Also this one but it seems to be so popular

@dtrifiro
Copy link
Contributor

dtrifiro commented Mar 4, 2022

Hi @alealv, thanks for the feedback! We're researching into performance improvements using asyncio, I'll let you know once we have updates

@karajan1001
Copy link
Contributor

So, this week I worked again with this issue and I tried to explore in more depth DVC code.

I saw that there has been some improvements over the code and that you are now using fsspec. Also, I see that there is still a ThreadPoolExecutor AFAIK, this is to allow asynchronous reads on multiple files and avoid being blocked by the read operation.

I was wondering if this could be improve by using cooperative multitasking instead. I've read that Fsspec has async option for some FS mainly for LocalFS, MemoryFS, HTTPFS.

Furthermore, this could be scale also with the available cores in the machine, one producer the os.walk and at least one consumer which will get the hash value (more on this in this interesting talk)

PD: For local file system there this interesting asyncio library. Also this one but it seems to be so popular

For the past experience async is a little bit usually slower than multi-processing.

@daavoo daavoo removed the p1-important Important, aka current backlog of things to do label Dec 20, 2022
@mattseddon mattseddon closed this as not planned Won't fix, can't repro, duplicate, stale Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: status Related to the dvc diff/list/status performance improvement over resource / time consuming tasks
Projects
None yet
Development

No branches or pull requests

9 participants