status: it's slow #5594

jorgeorpinel · 2021-03-11T01:23:24Z

Bug Report

Description

status seems slow.

Reproduce

$ git clone [email protected]:iterative/example-get-started.git
$ cd example-get-started
$ dvc pull
$ dvc status
Cloning

It lingers there for a few seconds, before changing to Data and pipelines are up to date.

Before 2.0 this was pretty much instantaneous. A few users have reported this on Discord BTW (2 in #q-and-a earlier today, one of them mentioned add being slow too). @efiop mentioned it could be related to the new Dulwich implementation.

Expected

Instantaneous report, esp. for such a simple project as in the example above.

Environment information

Output of dvc doctor:

$ dvc version
DVC version: 2.0.5
---------------------------------
Platform: Python 3.6.9 on Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-Ubuntu-18.04-bionic
Supports: All remotes
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdb
Caches: local
Remotes: https
Workspace directory: ext4 on /dev/sdb
Repo: dvc, git

Additional Information (if any):

The text was updated successfully, but these errors were encountered:

pmrowla · 2021-03-11T03:05:48Z

Based on the discord discussion it's probably because of #5544

We can revert the status check change for now, I'll look into seeing if using pygit2's status implementation is faster

jorgeorpinel · 2021-03-11T04:49:55Z

Is that done during Cloning? Does it affect add too? Thanks

pmrowla · 2021-03-11T05:31:45Z

It's not related to cloning. It's pretty much done whenever we load stages and yes it's what is also affecting the add performance. Or at least, the issue that should be resolved by the linked PR is definitely the cause of the add and repro performance issues. There maybe an unrelated status issue as well though?

pmrowla · 2021-03-11T05:39:40Z

Ok, so it looks like the status issue is actually separate. It's still a dulwich/git issue, but related to gitignore performance.

dvc status in the DVC git repo:

The add/repro performance issue should maybe go into its own GH issue, as it's related to the git status/git-track reminder (although gitignore performance probably also affects those commands as well)

skshetry · 2021-03-11T05:42:29Z

@pmrowla, is that happening on an external repo? It should only be in effect for local repository.

skshetry · 2021-03-11T05:47:29Z

For me, it's quite fast for example-get-started.

pmrowla · 2021-03-11T05:49:31Z

@skshetry the gitignore cprofile stuff that I posted is from running dvc status inside the DVC git repo (which has several gitignore patterns to match)

In example-get-started (which has almost none) it is also fast for me.

Using the 1.11 branch, dvc status takes ~1/3 the time to run in the DVC git repo for me:

alealv · 2021-04-20T16:19:45Z

👋

I'm working with a huge dataset (5TB) and running status or commit command is becoming a bit annoying. I would expect them to take a lot of time but it's taken ages (I mean days to process) and before computing hashes, they don't give any output whatsoever. Hence I have no way of knowing if the process is dead or what.

So I decided to check if there is any way of improving the code and I found this thread.

Profiling

I've profiled the status command as done above but only for two different stages of my pipeline:

The first one processes ~106GB of data (~292370 files). It uses the LibriSpeech ASR dataset
The second processes ~50GB of data (~268270 files). It uses the TED-LIUM dataset

So I noticed two things:

Most of the time is spent in a deadlock (code section)
The other big chunk is because of os.walk (code section)

Comparing times with Bash

I've done the comparison only for the TED-LIUM dataset (second process)

List files:

aalvarez@ml1:/mnt/data/dvc/tmp/data/tedlium_waves$ time ls . > /dev/null

real    0m2,656s
user    0m0,607s
sys     0m0,350s

aalvarez@ml1:~$ time find /mnt/data/dvc/tmp/data/tedlium_waves -type f > /dev/null

real    0m26,133s
user    0m0,222s
sys     0m3,540s

So comparing to os.walk this takes significantly less time
DVC --> 538s/60=8.966m

Computing hashes

aalvarez@ml1:~$ time find /mnt/data/dvc/tmp/data/tedlium_waves -type f -exec md5sum {} \; > /dev/null

real    12m4,800s
user    5m2,927s
sys     3m47,566s

aalvarez@ml1:~$ time find /mnt/data/dvc/tmp/data/tedlium_waves -type f | parallel --progress md5sum > /dev/null

Computers / CPU cores / Max jobs to run
1:local / 36 / 36

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/268263/100%/0.0s

real    7m38,746s
user    10m26,780s
sys     9m43,082s

And hash computation is more or less the same depending on which number you take, but still above what it can be accomplished.
DVC --> 660s/60=11m

Propositon

Truly parallelism would come by the hand of a multiprocessing.Pool instead of ThreadPoolExecutor, also I believed that it would be best to use imap_unorder
Maybe this is going too deep but looking at the os.walk code it first appends everything to a list and then it returns. Wouldn't be better to use scandir() and yield?

alealv · 2021-07-21T08:32:46Z

Hi, are there any updates on this?

efiop · 2021-07-21T21:48:22Z

@alealv Could you show dvc doctor too, please?

As to progress from our side, we didn't have a chance to look into this specific issue yet, but have been working on some pre-requisites (e.g. dulwich). If you could post new results with the recent dvc version, that would be really helpful.

alealv · 2021-07-22T10:51:18Z

This is the dvc doctor output

➜ dvc doctor
DVC version: 2.5.4 (pip)
---------------------------------
Platform: Python 3.8.5 on Linux-5.4.0-71-generic-x86_64-with-glibc2.29
Supports:
        http (requests = 2.25.1),
        https (requests = 2.25.1)
Cache types: symlink
Cache directory: nfs4 on 10.15.0.20:/volume1/MLData
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/mapper/ubuntu--vg-ubuntu--lv
Repo: dvc (subdir), git

efiop · 2021-07-22T15:04:32Z

@alealv Do you get the same results with this latest version as you did a few months ago?

alealv · 2021-07-22T15:32:07Z

I haven't run the same tests, but I can say that I ran dvc status (for the whole 5TB dataset) for more than 3 hours and it didn't finish. Also dvc commit run for more than 5 days and I stopped it because I didn't know if it was stalled or what.

It would be best if we coordinate on which tests to perform or create a specific repo to reproduce this.

efiop · 2021-07-22T15:47:43Z

@alealv If you could add --cprofile-dump dump.prof to any of those commands and share that file here - that would be enough for us to take a closer look.

alealv · 2021-08-25T18:05:12Z

Hi there! After almost one month the process finished.

dump.prof.gz

efiop · 2021-08-25T18:37:12Z

@alealv Whoa, that's crazy! Thank you so much for for the profiling dump! 🙏 I feel really bad for not clarifying that you could've tried this with a smaller dataset and thus wasting so much of your time 🙁

It finished successfully, right? And you were using 2.5.4? Just clarifying

What I can see from the dump right away is that stat is around 100 times slower than on my machine, which is usually because of nfs and in your case it probably is 🙁 But we need to take a closer look.

Good news here is that since the last time, we've reworked our object-related logic and we've reworked our benchmarking framework , where we will be adding a similar dataset size to yours (from logs, it looks like your dataset is ~1.7 million of files, right?) to keep an eye on this. We are also working on some optimizations right now for status and diff, as well as we have an NFS-related epic coming up in september.

alealv · 2021-08-26T16:15:45Z

It finished successfully, right? And you were using 2.5.4? Just clarifying

Indeed

What I can see from the dump right away is that stat is around 100 times slower than on my machine, which is usually because of nfs and in your case it probably is But we need to take a closer look.

Yes, the cache is located on an NFS.

(from logs, it looks like your dataset is ~1.7 million of files, right?)

Actually, it is 12,601,130

Good news here is that since the last time...

Cool, that's great.

alealv · 2021-12-02T00:29:10Z

Hi me again!

I've been using DVC for a new repository and I notice that the same dependencies are checked each time they appear on a stage. I would expect this to be checked only once and that it assumes the state hasn't changed for the following stages.

The only important remark is that I've settled cache: false to data dependencies.

Here is a dump of the verbose output of dvc status and also the profiling result
https://we.tl/t-ZjgFk3KBdo

➜ dvc --version
2.8.3

alealv · 2022-03-03T16:40:44Z

So, this week I worked again with this issue and I tried to explore in more depth DVC code.

I saw that there has been some improvements over the code and that you are now using fsspec. Also, I see that there is still a ThreadPoolExecutor AFAIK, this is to allow asynchronous reads on multiple files and avoid being blocked by the read operation.

I was wondering if this could be improve by using cooperative multitasking instead. I've read that Fsspec has async option for some FS mainly for LocalFS, MemoryFS, HTTPFS.

Furthermore, this could be scale also with the available cores in the machine, one producer the os.walk and at least one consumer which will get the hash value (more on this in this interesting talk)

PD: For local file system there this interesting asyncio library. Also this one but it seems to be so popular

dtrifiro · 2022-03-04T10:26:33Z

Hi @alealv, thanks for the feedback! We're researching into performance improvements using asyncio, I'll let you know once we have updates

karajan1001 · 2022-03-07T09:24:31Z

So, this week I worked again with this issue and I tried to explore in more depth DVC code.

I saw that there has been some improvements over the code and that you are now using fsspec. Also, I see that there is still a ThreadPoolExecutor AFAIK, this is to allow asynchronous reads on multiple files and avoid being blocked by the read operation.

I was wondering if this could be improve by using cooperative multitasking instead. I've read that Fsspec has async option for some FS mainly for LocalFS, MemoryFS, HTTPFS.

Furthermore, this could be scale also with the available cores in the machine, one producer the os.walk and at least one consumer which will get the hash value (more on this in this interesting talk)

PD: For local file system there this interesting asyncio library. Also this one but it seems to be so popular

For the past experience async is a little bit usually slower than multi-processing.

jorgeorpinel added the performance improvement over resource / time consuming tasks label Mar 11, 2021

pmrowla self-assigned this Mar 11, 2021

pmrowla added the p0-critical Critical issue. Needs to be fixed ASAP. label Mar 11, 2021

pmrowla mentioned this issue Mar 11, 2021

repro/exp run: revert use git status when reminding to git-track files #5596

Merged

2 tasks

pmrowla added p1-important Important, aka current backlog of things to do and removed p0-critical Critical issue. Needs to be fixed ASAP. labels Mar 11, 2021

pmrowla removed their assignment Mar 11, 2021

mattseddon mentioned this issue Mar 29, 2021

Patch list speed with .dvcignore entry iterative/vscode-dvc#211

Merged

mattseddon mentioned this issue Sep 13, 2021

Better support SCM functionality iterative/vscode-dvc#772

Closed

pared mentioned this issue Sep 16, 2021

commands: enhance our profiling capabilities #6627

Closed

daavoo added the A: status Related to the dvc diff/list/status label Oct 20, 2021

skshetry mentioned this issue Nov 15, 2021

add: incredibly slow #6977

Closed

daavoo removed the p1-important Important, aka current backlog of things to do label Dec 20, 2022

mattseddon closed this as not planned Won't fix, can't repro, duplicate, stale Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

status: it's slow #5594

status: it's slow #5594

jorgeorpinel commented Mar 11, 2021 •

edited

Loading

pmrowla commented Mar 11, 2021

jorgeorpinel commented Mar 11, 2021

pmrowla commented Mar 11, 2021 •

edited

Loading

pmrowla commented Mar 11, 2021 •

edited

Loading

skshetry commented Mar 11, 2021

skshetry commented Mar 11, 2021

pmrowla commented Mar 11, 2021 •

edited

Loading

alealv commented Apr 20, 2021 •

edited

Loading

alealv commented Jul 21, 2021

efiop commented Jul 21, 2021

alealv commented Jul 22, 2021

efiop commented Jul 22, 2021

alealv commented Jul 22, 2021

efiop commented Jul 22, 2021

alealv commented Aug 25, 2021

efiop commented Aug 25, 2021

alealv commented Aug 26, 2021 •

edited by efiop

Loading

alealv commented Dec 2, 2021

alealv commented Mar 3, 2022 •

edited

Loading

dtrifiro commented Mar 4, 2022

karajan1001 commented Mar 7, 2022

status: it's slow #5594

status: it's slow #5594

Comments

jorgeorpinel commented Mar 11, 2021 • edited Loading

Bug Report

Description

Reproduce

Expected

Environment information

pmrowla commented Mar 11, 2021

jorgeorpinel commented Mar 11, 2021

pmrowla commented Mar 11, 2021 • edited Loading

pmrowla commented Mar 11, 2021 • edited Loading

skshetry commented Mar 11, 2021

skshetry commented Mar 11, 2021

pmrowla commented Mar 11, 2021 • edited Loading

alealv commented Apr 20, 2021 • edited Loading

Profiling

Comparing times with Bash

Propositon

alealv commented Jul 21, 2021

efiop commented Jul 21, 2021

alealv commented Jul 22, 2021

efiop commented Jul 22, 2021

alealv commented Jul 22, 2021

efiop commented Jul 22, 2021

alealv commented Aug 25, 2021

efiop commented Aug 25, 2021

alealv commented Aug 26, 2021 • edited by efiop Loading

alealv commented Dec 2, 2021

alealv commented Mar 3, 2022 • edited Loading

dtrifiro commented Mar 4, 2022

karajan1001 commented Mar 7, 2022

jorgeorpinel commented Mar 11, 2021 •

edited

Loading

pmrowla commented Mar 11, 2021 •

edited

Loading

pmrowla commented Mar 11, 2021 •

edited

Loading

pmrowla commented Mar 11, 2021 •

edited

Loading

alealv commented Apr 20, 2021 •

edited

Loading

alealv commented Aug 26, 2021 •

edited by efiop

Loading

alealv commented Mar 3, 2022 •

edited

Loading