Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import/update: cache git repos/clones #3496

Open
casperdcl opened this issue Mar 17, 2020 · 7 comments
Open

import/update: cache git repos/clones #3496

casperdcl opened this issue Mar 17, 2020 · 7 comments
Labels
enhancement Enhances DVC help wanted p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks

Comments

@casperdcl
Copy link
Contributor

casperdcl commented Mar 17, 2020

dvc import https://some/git/repo/ some_file
dvc update  # should not re-clone, should only pull into existing cache
@casperdcl casperdcl added p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks enhancement Enhances DVC labels Mar 17, 2020
@casperdcl casperdcl self-assigned this Apr 2, 2020
@Suor
Copy link
Contributor

Suor commented Apr 3, 2020

The thing is cache is not persisted between dvc runs, if we make it persist then that won't reclone only make git pull in dvc update.

@casperdcl
Copy link
Contributor Author

yes; this is about making it persistent & pulling rather than re-cloning.

@jorgeorpinel
Copy link
Contributor

What about a repo cache at the user level? Could be a system config var so you can disable it, like analytics.

Context: #4203

@casperdcl
Copy link
Contributor Author

in light of #4246 being merged going to downgrade priority here...

@casperdcl casperdcl added p3-nice-to-have It should be done this or next sprint and removed p2-medium Medium priority, should be done, but less important labels Jul 19, 2021
@efiop efiop changed the title update: cache git repos import/update: cache git repos/clones Dec 8, 2023
@johnyaku
Copy link

Persistent clones (as per #10511) are different from shallow clones (as per #4246).
Both speed up cloning (or potentially avoid it) but only persistent clones can allow us to work with imported data without internet connectivity, which is necessary for us on a HPC where most queues have no connectivity.

Persistent clones would also allow us to separate cloning (which requires connectivity) from other dvc operations (which don't). This would allow us to do the former in an environment (queue) with connectivity and the latter in environments without.

@dberenbaum
Copy link
Collaborator

@johnyaku Have you considered keeping a clone on a shared space of the HPC so you can import from there instead of from the internet? Even if dvc had some support for caching clones, it would likely still need to check the internet to fetch updates from those clones. If you have your own clone of the repo, you can fully control when to update it and everyone can share that single repo copy (dvc will not make a new clone of a local repo).

@shcheklein shcheklein added p2-medium Medium priority, should be done, but less important and removed p3-nice-to-have It should be done this or next sprint labels Aug 15, 2024
@johnyaku
Copy link

johnyaku commented Aug 21, 2024

@dberenbaum I've been thinking along the same lines. We could maintain (and periodically update) repos on the local filesystem and specify the path to those repos instead of GitHub URLs. This would solve the no-internet access problem.

But we also want to maintain portability between platforms. (We work on two different HPCs, plus GCP.) So a URL that is accessible from any platform would be better from a portability perspective.

I can have calls to GitHub loopback to localhost in ~/.ssh/config but then I'd need to change those settings back in order to update the local mirrors, which is potentially tedious (and error prone if there are other processes accessing ~/.ssh/config at the same time). If I could maintain two separate configs then I might be able to get it to work but AFAIK the location of the SSH config is not configurable.

Happy to explore workarounds like this, or maybe dvc could keep the clones that is making already?

@skshetry skshetry added help wanted p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels Aug 27, 2024
@skshetry skshetry pinned this issue Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC help wanted p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks
Projects
None yet
Development

No branches or pull requests

7 participants