-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up dvc status for large projects #3280
Comments
I think we should try to reproduce such big repo and see what takes so much time, I suspect db access. |
Reading through discord dialog and looking at the log I don't think this is about big repo. This is probably about many imported files, which create many clones on |
@Suor num of files versioned by dvc is around 9k from may I know if there's any way to improve it ? |
This is obviously an issue on our end. I will think how this may be sped up. |
@Ykid Can you say how many different sources do you use in those 64 import stages? Source is a pair of |
@Suor There is one url as it is our data registry. I follow
|
@Ykid thanks. |
A note from discord - git repo is big, both history and checked out things:
The last change only caches single instance of repo, not all of them, which prevents us from needing new I am trying to untangle it now. |
May I know what this means ? |
So we have 3 things cached now separately: - clean clones, to not ask for creds repatedly - cache dirs, also shared between erepos with same origin - checked out clones if they are read only, addressed by (url, hexsha) Several additions to `Git` along the way: - Git.is_sha() static method - .pull() and .push() work with multiple returned records correctly - .get_rev() and .resolve_rev() work faster - .resolve_rev() looks for remote branches - .has_rev() Fixes iterative#3280.
Sorry for slow response. This means that when you have many imports from the same repo Anyway, this should be fixed after #3286 lands. You can try it right now with: pip install git+https://github.com/Suor/dvc.git@erepo-ro If you do, can you please tell how well does it work for you? |
* erepo: cache all read only external repos by hexsha So we have 3 things cached now separately: - clean clones, to not ask for creds repatedly - cache dirs, also shared between erepos with same origin - checked out clones if they are read only, addressed by (url, hexsha) Several additions to `Git` along the way: - Git.is_sha() static method - .pull() and .push() work with multiple returned records correctly - .get_rev() and .resolve_rev() work faster - .resolve_rev() looks for remote branches - .has_rev() Fixes #3280. * git: improve .resolve_rev() It follows `git checkout` logic now - if name can be unambiguously resolved across known remotes then it's done.
DVC version: 0.82.9+f73900 ERROR: failed to obtain data status - 'Git' object has no attribute 'is_known'
|
Reopening and escalating the priority. @Suor Please take a look ASAP, we need to release a new version with the fix ASAP as well. |
Looks like |
Handled in #3323. |
@Ykid 0.85.0 is out on pip and conda, please upgrade, give it a try and let us know if it fixed the issue for you. Thanks for the feedback! 🙂 |
The bug related to git is fixed, but there seem to be not much performance improved. :(.
|
@Suor Please take a look. |
So the optimization works for me, no unneeded clones or copies done. |
@Ykid I made a branch, which has more erepo logging, can you try it to see what is actually happening on your side and how much time that takes? pip install git+https://github.com/Suor/dvc.git@erepo-log
dvc status -v
# And paste output here |
Closing due to inactivity. |
Please provide information about your setup
DVC version(i.e.
dvc --version
), Platform and method of installation (pip, homebrew, pkg Mac, exe (Windows), DEB(Linux), RPM(Linux))expect
dvc status
can return in a shorter time, currently it takes 30s.a few stats
log.txt
The text was updated successfully, but these errors were encountered: