Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get/import: --rev BRANCH: shallow clone #3585

Closed
wants to merge 6 commits into from

Conversation

casperdcl
Copy link
Contributor

@casperdcl casperdcl commented Apr 2, 2020

dvc <get|import> REPO --rev BRANCH should clone with --single-branch --branch BRANCH rather than cloning the full repo.

We could also add --depth 1 to avoid getting the full commit history.

Alternatively we could do --branch BRANCH --no-single-branch --depth 1 to get all heads without history.

Sorry, something went wrong.

Verified

This commit was signed with the committer’s verified signature.
casperdcl Casper da Costa-Luis
…round)
@casperdcl casperdcl added bug Did we break something? p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks research discussion requires active participation to reach a conclusion labels Apr 2, 2020
@casperdcl casperdcl requested review from Suor, shcheklein and efiop April 2, 2020 15:30
@casperdcl casperdcl self-assigned this Apr 2, 2020

Verified

This commit was signed with the committer’s verified signature.
casperdcl Casper da Costa-Luis
@casperdcl
Copy link
Contributor Author

casperdcl commented Apr 2, 2020

It's probably best to keep --no-single-branch so that other HEADs are tracked in case checking out something else is required.

It looks like --rev has always been passed to git clone --branch. This means that we've never supported non-branch revs...

So I think it's best to have --no-single-branch --depth=1. This shouldn't break backward-compatibility. Supporting commit revs should be for a different issue.

EDIT: nvm; not true.

Basically we need to avoid:

dvc ((28d16f7...))$ dvc import https://github.com/iterative/dvc \
  README.rst --rev 28d16f7

ERROR: failed to import 'README.rst' from 'https://github.com/iterative/dvc'. - unknown Git revision '28d16f7'

Which is caused by passing 28d16f7 to git clone --branch.

We can't rely on remotes/servers configured to allow commit clones (https://stackoverflow.com/a/30701724/3896283)

Verified

This commit was signed with the committer’s verified signature.
casperdcl Casper da Costa-Luis

Verified

This commit was signed with the committer’s verified signature.
casperdcl Casper da Costa-Luis

Verified

This commit was signed with the committer’s verified signature.
casperdcl Casper da Costa-Luis

Verified

This commit was signed with the committer’s verified signature.
casperdcl Casper da Costa-Luis
@casperdcl casperdcl marked this pull request as ready for review April 2, 2020 18:28
@shcheklein
Copy link
Member

@casperdcl there is a cache of external repos that we support (check CLONES in the external_repo.py). I think with this change we'll have to modify how this cache behave. It's not about modifying the clone operation only, unfortunately.

@shcheklein
Copy link
Member

@Suor can give more context on this.

@casperdcl
Copy link
Contributor Author

The way this is implemented (with the fallback on a full clone) I doubt anything should break (the tests are all passing). I'd need a unit test or at least a demo scenario where this would break.

Within a cached session, a shallow clone of a branch would have to be followed by a checkout of a SHA. That would break things. When would that ever realistically happen though?

@casperdcl
Copy link
Contributor Author

casperdcl commented Apr 2, 2020

recap after discussion with @shcheklein: untested behaviour is broken: dvc.api.open(rev=A); dvc.api.open(rev=B).

Need to sort out session (or even global #3496) repo caches first.

CC: @efiop @Suor

@casperdcl casperdcl removed their assignment Apr 2, 2020
@@ -248,7 +248,7 @@ def _clone_default_branch(url, rev):
else:
logger.debug("erepo: git clone %s to a temporary dir", url)
clone_path = tempfile.mkdtemp("dvc-clone")
git = Git.clone(url, clone_path)
git = Git.clone(url, clone_path, rev=rev)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks clones cache below. As it presumes it has default branch.

Comment on lines +84 to +89
if rev:
logger.debug("attempting shallow clone of branch %s", rev)
tmp_repo = clone_from(branch=rev, depth=1)
else:
logger.debug("full clone")
tmp_repo = clone_from()
Copy link
Contributor

@Suor Suor Apr 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very dangerous code, which will cause us grief. We shouldn't hide that we are sometimes make shallow code and sometimes not from outside. Outside may rely on clone being full as it currently does in external_repo.py.

I mean if we do want to make shallow or single branch clone we should request it in a call signature.

BTW, the retry below doesn't look good either, can we do it some other way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're going to have to fix "outside" (i.e. external_repo.py) first

@Suor
Copy link
Contributor

Suor commented Apr 3, 2020

Some explanations on how external repo cache works now:

  • we cache default branch clone, which is presumed to be full clone. This clone is updated with git pull each time we try to access anything but not a known sha (branch, tag, unknown sha)
  • we also cache checked out repos, these are copies of the default branch repo, so they are full too. They are cached by (url, sha) pair, so will never be updated. Full clone is not really needed.
  • we have write mode, in which copied and checked out repo is presumed to be modified, we do the same as above, but don't cache it, the copy is removed in the end
  • caches are in-process, this means that separate dvc commands, either sequential or parallel, won't use shared cache. This may cause issues for people having big repos.
  • we only clone each url once per process run. This was the requirement because git clone may ask for a password and we don't want people to be asked for the same password repeatedly. git pull, which we use to update an existing clone doesn't cause password prompt. Note, passwords are usually asked for https urls, git/ssh urls rely on keys.

Switching from full to shallow clones will cause some issues:

  • we can't simply clone for each pair of (url, rev) and cache it because people will be asked password repeatedly
  • if we make a shallow copy of default branch then we can't check out other branch or tag or sha as we do now
  • we may, however, try to git fetch whatever we want to check out to, make a copy, then checkout. There are some subconsiderations:
    • need to check that git fetch won't ask password
    • do not need to fetch a known sha (same optimization we do know for git pull)
    • it doesn't matter whether we already know branch or tag, we still need git fetch since they might have moved

Making cache work between runs will also cause us some issues:

  • several processes may perform the same clone simultaneously, have no idea whether git handles this anyhow
  • several processes might be updating a default clone simultaneously. Git might already handle this, but need to check
  • one process may be making a copy of default clone, while other is updating it via git pull or git fetch. If we switch from simple copying to git clone local/path this might be already solved by git, but this needs to be checked
  • if git doesn't handle any of the above by its lock then we will need to protect the corresponding op with our lock

Bringing external repo cache outside a process is a prereq for #3496. This is not required to fix #3473 though, which is about shallow clones. So the parts above are independent.

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See explanations on external repo cache above. I recommend separating this effort into two:

  • one about shallow clones
  • one about inter-process/between runs cache

@casperdcl
Copy link
Contributor Author

@Suor (CC @shcheklein @efiop)

I think we're all on the same page; there's only one thing I disagree with (right at the end).

  • we cache default branch clone, which is presumed to be full clone.

This needs to be changed to remove this presumption (i.e. change to fetch missing things)

This clone is updated with git pull each time we try to access anything but not a known sha (branch, tag, unknown sha)

Would need to fetch --all and potentially --unshallow if still not found.

  • we also cache checked out repos, these are copies of the default branch repo, so they are full too. They are cached by (url, sha) pair, so will never be updated. Full clone is not really needed.

You mean we have two copies of each repo? Why? Sounds like my response above should apply to this "working copy" repo instead.

Switching from full to shallow clones will cause some issues:

  • we can't simply clone for each pair of (url, rev) and cache it because people will be asked password repeatedly

I don't understand this at all. We'd clone (or pull if already in persistent cache) once for each url, and checkout for each rev.

  • if we make a shallow copy of default branch then we can't check out other branch or tag or sha as we do now
  • we may, however, try to git fetch whatever we want to check out to, make a copy, then checkout.

Currently we still really have to fetch/pull (e.g. to make sure when checking out a different branch that the branch is up-to-date). Maybe we've skipped that by assuming the repo was fully cloned in the same session and unlikely to have been updated since the last job within the session. Moving to a persistent cache, though, we naturally cannot assume checkout will work. I propose to fallback to fetch --all and then --unshallow when trying to checkout.

  • need to check that git fetch won't ask password

why? What's wrong with prompting if the user uses https without auth tokens? Git itself would prompt for every single operation in such scenarios. Git also has its own mechanism for a credential cache so we don't need to implement our own.

  • several processes may perform the same clone simultaneously, have no idea whether git handles this anyhow

the first process to create the target (cached) dir will succeed; the rest will fail without doing anything since the dir already exists. If the filesystem isn't atomic then I presume a process could needlessly perform an extra clone which gets immediately deleted

  • several processes might be updating a default clone simultaneously. Git might already handle this, but need to check

Yes, it should.

  • one process may be making a copy of default clone, while other is updating it via git pull or git fetch. If we switch from simple copying to git clone local/path this might be already solved by git, but this needs to be checked

Ah. starting to understand - so currently we create a full clone mirror and then copy from that mirror? Yeah that needs to change to:

  1. not assume the mirror is "full" (fallback to fetch --all and then --unshallow). Ideally also make the mirror "bare".
  2. clone - rather than copy - from the mirror to a working copy (ideally both shallow and sparse).
  • if git doesn't handle any of the above by its lock then we will need to protect the corresponding op with our lock

Yes; fun times.

Bringing external repo cache outside a process is a prereq for #3496.

Some misunderstanding here. I think it is #3496, not a separate prereq.

This is not required to fix #3473 though, which is about shallow clones. So the parts above are independent.

Fully disagree with this. As discussed above, my proposal would involve necessarily persisting repos first; then removing assumptions about their fullness/unshallowness, and then as an afterthought considering sparsity of working copies.

Quick recap of related issues (think we should create a new "super" issue or a mini-project to keep track of them?)

I think they have to be dealt with in this specific order.

@efiop
Copy link
Contributor

efiop commented Apr 22, 2020

For the record: agreed to freeze this for now, I'll come back to this after I utilize DvcTree in external_repo stuff. Should give a fresh perspective on this.

@efiop
Copy link
Contributor

efiop commented Jun 9, 2020

@pmrowla Could you take a look at this, please? It would be great to hear your opinion about it.

@pmrowla
Copy link
Contributor

pmrowla commented Jun 10, 2020

For erepos, in most cases we now generate a GitTree for the needed revision (using resolve_rev()) instead of doing the git checkout into the clone directory. (dvcx for_write=True being the exception where we still need a full clone + checkout)

So other than the dvcx use case, we don't really need full clones at all now. I think we should be able to use a single shallow clone (git clone --depth=1 --no-single-branch) and git fetch --depth=1 for being able to resolve additional specific revisions that we need. For branches we will need to be able to fetch/fast-forward to the latest revision so that HEAD points to the right place, but that is still doable with shallow clones.

I'm not sure how shallow clones + fetch work regarding the potential password prompt on fetch issue though.

edit: for resolving additional revisions we will need to do fetch --unshallow as mentioned in the earlier discussions

@pmrowla
Copy link
Contributor

pmrowla commented Jul 14, 2020

You mean we have two copies of each repo? Why? Sounds like my response above should apply to this "working copy" repo instead.

Just to clarify, this is no longer the case after the tree related erepo changes. We still have the single main full clone, but rather than separate copies for checkouts of other revisions, we now create GitTree's directly from the one initial clone (with dvcx for_write as the exception).

I don't have the bandwidth to finish this right now, but regarding shallow clones:

  • We can clone with --no-single-branch --depth=1 --branch=<branch> if the user is running dvc get/import --rev=<branch> where <branch> is an actual branch or a tag.
    This should probably be handled by exposing depth and branch as kwargs in:

    def clone(url, to_path, rev=None):

    and then passing the appropriate params when we create the erepo clone:
    def _clone_default_branch(url, rev):

  • If is a revision sha, we still need to do the existing full clone behavior

  • We also need to keep the existing standalone full clone for dvcx for_write

  • For shallow clones, if we need to access a different rev, we will need to do fetch --unshallow where we currently do

    dvc/dvc/external_repo.py

    Lines 284 to 286 in c1b04b8

    if not Git.is_sha(rev) or not git.has_rev(rev):
    logger.debug("erepo: git pull %s", url)
    git.pull()

Other than that, I don't think shallow clones should affect any of the tree related erepo behavior. Also, as far as I can tell from testing on my end, using shallow clones should not affect the current git password related behavior either.

@casperdcl casperdcl changed the title get/import: clone using --rev get/import: shallow clone Jul 14, 2020
@casperdcl casperdcl changed the title get/import: shallow clone get/import: --rev BRANCH: shallow clone Jul 14, 2020
@pmrowla pmrowla mentioned this pull request Jul 20, 2020
3 tasks
@pmrowla
Copy link
Contributor

pmrowla commented Jul 20, 2020

closing in favor of #4246.

@pmrowla pmrowla closed this Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? discussion requires active participation to reach a conclusion p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks research
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'dvc get' is cloning the full bitbucket repo when only one branch is needed
5 participants