-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
status: it's slow #5594
Comments
Based on the discord discussion it's probably because of #5544 We can revert the status check change for now, I'll look into seeing if using pygit2's status implementation is faster |
Is that done during |
It's not related to cloning. It's pretty much done whenever we load stages and yes it's what is also affecting the |
Ok, so it looks like the status issue is actually separate. It's still a dulwich/git issue, but related to gitignore performance.
The |
@pmrowla, is that happening on an external repo? It should only be in effect for local repository. |
@skshetry the gitignore cprofile stuff that I posted is from running In example-get-started (which has almost none) it is also fast for me. Using the |
👋 I'm working with a huge dataset (5TB) and running So I decided to check if there is any way of improving the code and I found this thread. ProfilingI've profiled the
So I noticed two things:
Comparing times with BashI've done the comparison only for the TED-LIUM dataset (second process)
aalvarez@ml1:/mnt/data/dvc/tmp/data/tedlium_waves$ time ls . > /dev/null
real 0m2,656s
user 0m0,607s
sys 0m0,350s
aalvarez@ml1:~$ time find /mnt/data/dvc/tmp/data/tedlium_waves -type f > /dev/null
real 0m26,133s
user 0m0,222s
sys 0m3,540s So comparing to os.walk this takes significantly less time
aalvarez@ml1:~$ time find /mnt/data/dvc/tmp/data/tedlium_waves -type f -exec md5sum {} \; > /dev/null
real 12m4,800s
user 5m2,927s
sys 3m47,566s
aalvarez@ml1:~$ time find /mnt/data/dvc/tmp/data/tedlium_waves -type f | parallel --progress md5sum > /dev/null
Computers / CPU cores / Max jobs to run
1:local / 36 / 36
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/268263/100%/0.0s
real 7m38,746s
user 10m26,780s
sys 9m43,082s And hash computation is more or less the same depending on which number you take, but still above what it can be accomplished. Propositon
|
Hi, are there any updates on this? |
@alealv Could you show As to progress from our side, we didn't have a chance to look into this specific issue yet, but have been working on some pre-requisites (e.g. dulwich). If you could post new results with the recent dvc version, that would be really helpful. |
This is the
|
@alealv Do you get the same results with this latest version as you did a few months ago? |
I haven't run the same tests, but I can say that I ran It would be best if we coordinate on which tests to perform or create a specific repo to reproduce this. |
@alealv If you could add |
Hi there! After almost one month the process finished. |
@alealv Whoa, that's crazy! Thank you so much for for the profiling dump! 🙏 I feel really bad for not clarifying that you could've tried this with a smaller dataset and thus wasting so much of your time 🙁 It finished successfully, right? And you were using 2.5.4? Just clarifying What I can see from the dump right away is that Good news here is that since the last time, we've reworked our object-related logic and we've reworked our benchmarking framework , where we will be adding a similar dataset size to yours (from logs, it looks like your dataset is ~1.7 million of files, right?) to keep an eye on this. We are also working on some optimizations right now for |
Indeed
Yes, the cache is located on an NFS.
Actually, it is 12,601,130
Cool, that's great. |
Hi me again! I've been using DVC for a new repository and I notice that the same dependencies are checked each time they appear on a stage. I would expect this to be checked only once and that it assumes the state hasn't changed for the following stages. The only important remark is that I've settled Here is a dump of the verbose output of
|
So, this week I worked again with this issue and I tried to explore in more depth DVC code. I saw that there has been some improvements over the code and that you are now using fsspec. Also, I see that there is still a ThreadPoolExecutor AFAIK, this is to allow asynchronous reads on multiple files and avoid being blocked by the read operation. I was wondering if this could be improve by using cooperative multitasking instead. I've read that Fsspec has async option for some FS mainly for LocalFS, MemoryFS, HTTPFS. Furthermore, this could be scale also with the available cores in the machine, one producer the PD: For local file system there this interesting asyncio library. Also this one but it seems to be so popular |
Hi @alealv, thanks for the feedback! We're researching into performance improvements using |
For the past experience |
BugReportDescription
status
seems slow.Reproduce
It lingers there for a few seconds, before changing to
Data and pipelines are up to date.
Before 2.0 this was pretty much instantaneous. A few users have reported this on Discord BTW (2 in #q-and-a earlier today, one of them mentioned
add
being slow too). @efiop mentioned it could be related to the new Dulwich implementation.Expected
Instantaneous report, esp. for such a simple project as in the example above.
Environment information
Output of
dvc doctor
:Additional Information (if any):
The text was updated successfully, but these errors were encountered: