-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get/import: --rev BRANCH: shallow clone #3585
Conversation
It's probably best to keep
EDIT: nvm; not true. Basically we need to avoid: dvc ((28d16f7...))$ dvc import https://github.com/iterative/dvc \
README.rst --rev 28d16f7
ERROR: failed to import 'README.rst' from 'https://github.com/iterative/dvc'. - unknown Git revision '28d16f7' Which is caused by passing We can't rely on remotes/servers configured to allow commit clones (https://stackoverflow.com/a/30701724/3896283) |
@casperdcl there is a cache of external repos that we support (check |
@Suor can give more context on this. |
The way this is implemented (with the fallback on a full clone) I doubt anything should break (the tests are all passing). I'd need a unit test or at least a demo scenario where this would break. Within a cached session, a shallow clone of a branch would have to be followed by a checkout of a SHA. That would break things. When would that ever realistically happen though? |
recap after discussion with @shcheklein: untested behaviour is broken: Need to sort out session (or even global #3496) repo caches first. |
@@ -248,7 +248,7 @@ def _clone_default_branch(url, rev): | |||
else: | |||
logger.debug("erepo: git clone %s to a temporary dir", url) | |||
clone_path = tempfile.mkdtemp("dvc-clone") | |||
git = Git.clone(url, clone_path) | |||
git = Git.clone(url, clone_path, rev=rev) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks clones cache below. As it presumes it has default branch.
if rev: | ||
logger.debug("attempting shallow clone of branch %s", rev) | ||
tmp_repo = clone_from(branch=rev, depth=1) | ||
else: | ||
logger.debug("full clone") | ||
tmp_repo = clone_from() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very dangerous code, which will cause us grief. We shouldn't hide that we are sometimes make shallow code and sometimes not from outside. Outside may rely on clone being full as it currently does in external_repo.py
.
I mean if we do want to make shallow or single branch clone we should request it in a call signature.
BTW, the retry below doesn't look good either, can we do it some other way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're going to have to fix "outside" (i.e. external_repo.py
) first
Some explanations on how external repo cache works now:
Switching from full to shallow clones will cause some issues:
Making cache work between runs will also cause us some issues:
Bringing external repo cache outside a process is a prereq for #3496. This is not required to fix #3473 though, which is about shallow clones. So the parts above are independent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See explanations on external repo cache above. I recommend separating this effort into two:
- one about shallow clones
- one about inter-process/between runs cache
@Suor (CC @shcheklein @efiop) I think we're all on the same page; there's only one thing I disagree with (right at the end).
This needs to be changed to remove this presumption (i.e. change to fetch missing things)
Would need to
You mean we have two copies of each repo? Why? Sounds like my response above should apply to this "working copy" repo instead.
I don't understand this at all. We'd clone (or pull if already in persistent cache) once for each
Currently we still really have to fetch/pull (e.g. to make sure when checking out a different branch that the branch is up-to-date). Maybe we've skipped that by assuming the repo was fully cloned in the same session and unlikely to have been updated since the last job within the session. Moving to a persistent cache, though, we naturally cannot assume checkout will work. I propose to fallback to
why? What's wrong with prompting if the user uses https without auth tokens? Git itself would prompt for every single operation in such scenarios. Git also has its own mechanism for a credential cache so we don't need to implement our own.
the first process to create the target (cached) dir will succeed; the rest will fail without doing anything since the dir already exists. If the filesystem isn't atomic then I presume a process could needlessly perform an extra clone which gets immediately deleted
Yes, it should.
Ah. starting to understand - so currently we create a full clone mirror and then copy from that mirror? Yeah that needs to change to:
Yes; fun times.
Some misunderstanding here. I think it is #3496, not a separate prereq.
Fully disagree with this. As discussed above, my proposal would involve necessarily persisting repos first; then removing assumptions about their fullness/unshallowness, and then as an afterthought considering sparsity of working copies. Quick recap of related issues (think we should create a new "super" issue or a mini-project to keep track of them?)
I think they have to be dealt with in this specific order. |
For the record: agreed to freeze this for now, I'll come back to this after I utilize DvcTree in external_repo stuff. Should give a fresh perspective on this. |
@pmrowla Could you take a look at this, please? It would be great to hear your opinion about it. |
For erepos, in most cases we now generate a So other than the dvcx use case, we don't really need full clones at all now. I think we should be able to use a single shallow clone ( I'm not sure how shallow clones + fetch work regarding the potential password prompt on fetch issue though. edit: for resolving additional revisions we will need to do |
Just to clarify, this is no longer the case after the tree related erepo changes. We still have the single main full clone, but rather than separate copies for checkouts of other revisions, we now create GitTree's directly from the one initial clone (with dvcx I don't have the bandwidth to finish this right now, but regarding shallow clones:
Other than that, I don't think shallow clones should affect any of the tree related erepo behavior. Also, as far as I can tell from testing on my end, using shallow clones should not affect the current git password related behavior either. |
closing in favor of #4246. |
dvc <get|import> REPO --rev BRANCH
should clone with--single-branch --branch BRANCH
rather than cloning the full repo.We could also add
--depth 1
to avoid getting the full commit history.Alternatively we could do
--branch BRANCH --no-single-branch --depth 1
to get all heads without history.--rev BRANCH
togit clone --branch BRANCH
--depth=1
--depth
option?--rev
--unshallow
once we have a proper global cache