get/import: --rev BRANCH: shallow clone #3585

casperdcl · 2020-04-02T15:30:03Z

dvc <get|import> REPO --rev BRANCH should clone with --single-branch --branch BRANCH rather than cloning the full repo.

We could also add --depth 1 to avoid getting the full commit history.

Alternatively we could do --branch BRANCH --no-single-branch --depth 1 to get all heads without history.

pass --rev BRANCH to git clone --branch BRANCH
- add --depth=1
  - expose --depth option?
fallback to full clone so that this doesn't break with other types of --rev
depends on import/update: cache git repos/clones #3496
- can fallback on --unshallow once we have a proper global cache
fixes 'dvc get' is cloning the full bitbucket repo when only one branch is needed #3473

…round)

casperdcl · 2020-04-02T15:57:19Z

It's probably best to keep --no-single-branch so that other HEADs are tracked in case checking out something else is required.

~~It looks like --rev has always been passed to git clone --branch. This means that we've never supported non-branch revs...~~

~~So I think it's best to have --no-single-branch --depth=1. This shouldn't break backward-compatibility. Supporting commit revs should be for a different issue.~~

EDIT: nvm; not true.

Basically we need to avoid:

dvc ((28d16f7...))$ dvc import https://github.com/iterative/dvc \
  README.rst --rev 28d16f7

ERROR: failed to import 'README.rst' from 'https://github.com/iterative/dvc'. - unknown Git revision '28d16f7'

Which is caused by passing 28d16f7 to git clone --branch.

We can't rely on remotes/servers configured to allow commit clones (https://stackoverflow.com/a/30701724/3896283)

shcheklein · 2020-04-02T22:29:02Z

@casperdcl there is a cache of external repos that we support (check CLONES in the external_repo.py). I think with this change we'll have to modify how this cache behave. It's not about modifying the clone operation only, unfortunately.

shcheklein · 2020-04-02T22:29:23Z

@Suor can give more context on this.

casperdcl · 2020-04-02T23:09:00Z

The way this is implemented (with the fallback on a full clone) I doubt anything should break (the tests are all passing). I'd need a unit test or at least a demo scenario where this would break.

Within a cached session, a shallow clone of a branch would have to be followed by a checkout of a SHA. That would break things. When would that ever realistically happen though?

casperdcl · 2020-04-02T23:16:49Z

recap after discussion with @shcheklein: untested behaviour is broken: dvc.api.open(rev=A); dvc.api.open(rev=B).

Need to sort out session (or even global #3496) repo caches first.

CC: @efiop @Suor

Suor · 2020-04-03T04:17:42Z

dvc/external_repo.py

@@ -248,7 +248,7 @@ def _clone_default_branch(url, rev):
        else:
            logger.debug("erepo: git clone %s to a temporary dir", url)
            clone_path = tempfile.mkdtemp("dvc-clone")
-            git = Git.clone(url, clone_path)
+            git = Git.clone(url, clone_path, rev=rev)


This breaks clones cache below. As it presumes it has default branch.

Suor · 2020-04-03T04:19:40Z

dvc/scm/git/__init__.py

+            if rev:
+                logger.debug("attempting shallow clone of branch %s", rev)
+                tmp_repo = clone_from(branch=rev, depth=1)
+            else:
+                logger.debug("full clone")
+                tmp_repo = clone_from()


This is very dangerous code, which will cause us grief. We shouldn't hide that we are sometimes make shallow code and sometimes not from outside. Outside may rely on clone being full as it currently does in external_repo.py.

I mean if we do want to make shallow or single branch clone we should request it in a call signature.

BTW, the retry below doesn't look good either, can we do it some other way?

we're going to have to fix "outside" (i.e. external_repo.py) first

Suor · 2020-04-03T07:38:46Z

Some explanations on how external repo cache works now:

we cache default branch clone, which is presumed to be full clone. This clone is updated with git pull each time we try to access anything but not a known sha (branch, tag, unknown sha)
we also cache checked out repos, these are copies of the default branch repo, so they are full too. They are cached by (url, sha) pair, so will never be updated. Full clone is not really needed.
we have write mode, in which copied and checked out repo is presumed to be modified, we do the same as above, but don't cache it, the copy is removed in the end
caches are in-process, this means that separate dvc commands, either sequential or parallel, won't use shared cache. This may cause issues for people having big repos.
we only clone each url once per process run. This was the requirement because git clone may ask for a password and we don't want people to be asked for the same password repeatedly. git pull, which we use to update an existing clone doesn't cause password prompt. Note, passwords are usually asked for https urls, git/ssh urls rely on keys.

Switching from full to shallow clones will cause some issues:

we can't simply clone for each pair of (url, rev) and cache it because people will be asked password repeatedly
if we make a shallow copy of default branch then we can't check out other branch or tag or sha as we do now
we may, however, try to git fetch whatever we want to check out to, make a copy, then checkout. There are some subconsiderations:
- need to check that git fetch won't ask password
- do not need to fetch a known sha (same optimization we do know for git pull)
- it doesn't matter whether we already know branch or tag, we still need git fetch since they might have moved

Making cache work between runs will also cause us some issues:

several processes may perform the same clone simultaneously, have no idea whether git handles this anyhow
several processes might be updating a default clone simultaneously. Git might already handle this, but need to check
one process may be making a copy of default clone, while other is updating it via git pull or git fetch. If we switch from simple copying to git clone local/path this might be already solved by git, but this needs to be checked
if git doesn't handle any of the above by its lock then we will need to protect the corresponding op with our lock

Bringing external repo cache outside a process is a prereq for #3496. This is not required to fix #3473 though, which is about shallow clones. So the parts above are independent.

Suor

See explanations on external repo cache above. I recommend separating this effort into two:

one about shallow clones
one about inter-process/between runs cache

casperdcl · 2020-04-03T14:23:58Z

@Suor (CC @shcheklein @efiop)

I think we're all on the same page; there's only one thing I disagree with (right at the end).

we cache default branch clone, which is presumed to be full clone.

This needs to be changed to remove this presumption (i.e. change to fetch missing things)

This clone is updated with git pull each time we try to access anything but not a known sha (branch, tag, unknown sha)

Would need to fetch --all and potentially --unshallow if still not found.

we also cache checked out repos, these are copies of the default branch repo, so they are full too. They are cached by (url, sha) pair, so will never be updated. Full clone is not really needed.

You mean we have two copies of each repo? Why? Sounds like my response above should apply to this "working copy" repo instead.

Switching from full to shallow clones will cause some issues:

we can't simply clone for each pair of (url, rev) and cache it because people will be asked password repeatedly

I don't understand this at all. We'd clone (or pull if already in persistent cache) once for each url, and checkout for each rev.

if we make a shallow copy of default branch then we can't check out other branch or tag or sha as we do now

we may, however, try to git fetch whatever we want to check out to, make a copy, then checkout.

Currently we still really have to fetch/pull (e.g. to make sure when checking out a different branch that the branch is up-to-date). Maybe we've skipped that by assuming the repo was fully cloned in the same session and unlikely to have been updated since the last job within the session. Moving to a persistent cache, though, we naturally cannot assume checkout will work. I propose to fallback to fetch --all and then --unshallow when trying to checkout.

need to check that git fetch won't ask password

why? What's wrong with prompting if the user uses https without auth tokens? Git itself would prompt for every single operation in such scenarios. Git also has its own mechanism for a credential cache so we don't need to implement our own.

several processes may perform the same clone simultaneously, have no idea whether git handles this anyhow

the first process to create the target (cached) dir will succeed; the rest will fail without doing anything since the dir already exists. If the filesystem isn't atomic then I presume a process could needlessly perform an extra clone which gets immediately deleted

several processes might be updating a default clone simultaneously. Git might already handle this, but need to check

Yes, it should.

one process may be making a copy of default clone, while other is updating it via git pull or git fetch. If we switch from simple copying to git clone local/path this might be already solved by git, but this needs to be checked

Ah. starting to understand - so currently we create a full clone mirror and then copy from that mirror? Yeah that needs to change to:

not assume the mirror is "full" (fallback to fetch --all and then --unshallow). Ideally also make the mirror "bare".
clone - rather than copy - from the mirror to a working copy (ideally both shallow and sparse).

if git doesn't handle any of the above by its lock then we will need to protect the corresponding op with our lock

Yes; fun times.

Bringing external repo cache outside a process is a prereq for #3496.

Some misunderstanding here. I think it is #3496, not a separate prereq.

This is not required to fix #3473 though, which is about shallow clones. So the parts above are independent.

Fully disagree with this. As discussed above, my proposal would involve necessarily persisting repos first; then removing assumptions about their fullness/unshallowness, and then as an afterthought considering sparsity of working copies.

Quick recap of related issues (think we should create a new "super" issue or a mini-project to keep track of them?)

import/update: cache git repos/clones #3496 persistent repo cache
get/import: --rev BRANCH: shallow clone #3585 <- 'dvc get' is cloning the full bitbucket repo when only one branch is needed #3473 shallow clone
Restyle WIP: Use pre-commit for Git hooks #3448 sparse checkout

I think they have to be dealt with in this specific order.

efiop · 2020-04-22T14:04:32Z

For the record: agreed to freeze this for now, I'll come back to this after I utilize DvcTree in external_repo stuff. Should give a fresh perspective on this.

efiop · 2020-06-09T13:48:37Z

@pmrowla Could you take a look at this, please? It would be great to hear your opinion about it.

pmrowla · 2020-06-10T05:26:05Z

For erepos, in most cases we now generate a GitTree for the needed revision (using resolve_rev()) instead of doing the git checkout into the clone directory. (dvcx for_write=True being the exception where we still need a full clone + checkout)

So other than the dvcx use case, we don't really need full clones at all now. I think we should be able to use a single shallow clone (git clone --depth=1 --no-single-branch) ~~and git fetch --depth=1 for being able to resolve additional specific revisions that we need~~. For branches we will need to be able to fetch/fast-forward to the latest revision so that HEAD points to the right place, but that is still doable with shallow clones.

I'm not sure how shallow clones + fetch work regarding the potential password prompt on fetch issue though.

edit: for resolving additional revisions we will need to do fetch --unshallow as mentioned in the earlier discussions

pmrowla · 2020-07-14T13:51:22Z

You mean we have two copies of each repo? Why? Sounds like my response above should apply to this "working copy" repo instead.

Just to clarify, this is no longer the case after the tree related erepo changes. We still have the single main full clone, but rather than separate copies for checkouts of other revisions, we now create GitTree's directly from the one initial clone (with dvcx for_write as the exception).

I don't have the bandwidth to finish this right now, but regarding shallow clones:

We can clone with --no-single-branch --depth=1 --branch=<branch> if the user is running dvc get/import --rev=<branch> where <branch> is an actual branch or a tag.
This should probably be handled by exposing depth and branch as kwargs in:

dvc/dvc/scm/git/__init__.py

Line 96 in c1b04b8

def clone(url, to_path, rev=None):

and then passing the appropriate params when we create the erepo clone:

dvc/dvc/external_repo.py

Line 272 in c1b04b8

def _clone_default_branch(url, rev):
If is a revision sha, we still need to do the existing full clone behavior
We also need to keep the existing standalone full clone for dvcx for_write

For shallow clones, if we need to access a different rev, we will need to do fetch --unshallow where we currently do

dvc/dvc/external_repo.py

Lines 284 to 286 in c1b04b8

    
           if not Git.is_sha(rev) or not git.has_rev(rev): 
        
               logger.debug("erepo: git pull %s", url) 
        
               git.pull()

Other than that, I don't think shallow clones should affect any of the tree related erepo behavior. Also, as far as I can tell from testing on my end, using shallow clones should not affect the current git password related behavior either.

pmrowla · 2020-07-20T07:49:52Z

closing in favor of #4246.

external repo: clone single specified branch with depth 1 (3473 worka…

Verified

This commit was signed with the committer’s verified signature.

casperdcl Casper da Costa-Luis

GPG key ID: D88C553DBD362CDE

Learn about vigilant mode

28d16f7

…round)

casperdcl added bug p2-medium performance research discussion labels Apr 2, 2020

casperdcl requested review from Suor, shcheklein and efiop April 2, 2020 15:30

casperdcl self-assigned this Apr 2, 2020

git: clone: back to --no-single-branch

Verified

This commit was signed with the committer’s verified signature.

casperdcl Casper da Costa-Luis

GPG key ID: D88C553DBD362CDE

Learn about vigilant mode

fd66d76

casperdcl added 4 commits April 2, 2020 17:22

git: clone: try partial, fallback full

Verified

This commit was signed with the committer’s verified signature.

casperdcl Casper da Costa-Luis

GPG key ID: D88C553DBD362CDE

Learn about vigilant mode

cec5b58

git: clone: minor bugfix

Verified

This commit was signed with the committer’s verified signature.

casperdcl Casper da Costa-Luis

GPG key ID: D88C553DBD362CDE

Learn about vigilant mode

a3cb0cd

git: clone: shallow only if rev specified

Verified

This commit was signed with the committer’s verified signature.

casperdcl Casper da Costa-Luis

GPG key ID: D88C553DBD362CDE

Learn about vigilant mode

e287dc9

fix py3.5

Verified

This commit was signed with the committer’s verified signature.

casperdcl Casper da Costa-Luis

GPG key ID: D88C553DBD362CDE

Learn about vigilant mode

d2666e8

casperdcl force-pushed the git-branch branch from 51a512f to d2666e8 Compare April 2, 2020 18:00

casperdcl marked this pull request as ready for review April 2, 2020 18:28

casperdcl removed their assignment Apr 2, 2020

Suor suggested changes Apr 3, 2020

View reviewed changes

weekly-digest bot mentioned this pull request Apr 26, 2020

Weekly Digest (19 April, 2020 - 26 April, 2020) #3683

Closed

efiop assigned pmrowla Jun 16, 2020

casperdcl changed the title ~~get/import: clone using --rev~~ get/import: shallow clone Jul 14, 2020

casperdcl changed the title ~~get/import: shallow clone~~ get/import: --rev BRANCH: shallow clone Jul 14, 2020

weekly-digest bot mentioned this pull request Jul 19, 2020

Weekly Digest (12 July, 2020 - 19 July, 2020) #4238

Closed

pmrowla mentioned this pull request Jul 20, 2020

erepo: --rev BRANCH: shallow clone #4246

Merged

3 tasks

pmrowla closed this Jul 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get/import: --rev BRANCH: shallow clone #3585

get/import: --rev BRANCH: shallow clone #3585

casperdcl commented Apr 2, 2020 •

edited

Loading

casperdcl commented Apr 2, 2020 •

edited

Loading

shcheklein commented Apr 2, 2020

shcheklein commented Apr 2, 2020

casperdcl commented Apr 2, 2020

casperdcl commented Apr 2, 2020 •

edited

Loading

Suor Apr 3, 2020

Suor Apr 3, 2020 •

edited

Loading

casperdcl Apr 3, 2020

Suor commented Apr 3, 2020

Suor left a comment

casperdcl commented Apr 3, 2020

efiop commented Apr 22, 2020

efiop commented Jun 9, 2020

pmrowla commented Jun 10, 2020 •

edited

Loading

pmrowla commented Jul 14, 2020

pmrowla commented Jul 20, 2020

get/import: --rev BRANCH: shallow clone #3585

get/import: --rev BRANCH: shallow clone #3585

Conversation

casperdcl commented Apr 2, 2020 • edited Loading

casperdcl commented Apr 2, 2020 • edited Loading

shcheklein commented Apr 2, 2020

shcheklein commented Apr 2, 2020

casperdcl commented Apr 2, 2020

casperdcl commented Apr 2, 2020 • edited Loading

Suor Apr 3, 2020

Choose a reason for hiding this comment

Suor Apr 3, 2020 • edited Loading

Choose a reason for hiding this comment

casperdcl Apr 3, 2020

Choose a reason for hiding this comment

Suor commented Apr 3, 2020

Suor left a comment

Choose a reason for hiding this comment

casperdcl commented Apr 3, 2020

efiop commented Apr 22, 2020

efiop commented Jun 9, 2020

pmrowla commented Jun 10, 2020 • edited Loading

pmrowla commented Jul 14, 2020

pmrowla commented Jul 20, 2020

casperdcl commented Apr 2, 2020 •

edited

Loading

casperdcl commented Apr 2, 2020 •

edited

Loading

casperdcl commented Apr 2, 2020 •

edited

Loading

Suor Apr 3, 2020 •

edited

Loading

pmrowla commented Jun 10, 2020 •

edited

Loading