Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

git clone vs downloading zip #180

Closed
elorest opened this issue Oct 30, 2017 · 26 comments
Closed

git clone vs downloading zip #180

elorest opened this issue Oct 30, 2017 · 26 comments
Milestone

Comments

@elorest
Copy link

elorest commented Oct 30, 2017

It seems that shards does a clone of the project vs just downloading the the relevant code for tag or branch. Wouldn't it be more efficient for both us and github to just download the required shard as a zip or tar.gz? curl -OL https://github.com/amberframework/amber/archive/v0.3.0.tar.gz

There were issues related to this with CocoaPods using github in this fashion a while back.
https://news.ycombinator.com/item?id=11245652

@Sija
Copy link
Contributor

Sija commented Oct 30, 2017

Wouldn't it be more efficient for both us and github to just download the required shard as a zip or tar.gz? curl -OL https://github.com/amberframework/amber/archive/v0.3.0.tar.gz

according to CocoaPods/CocoaPods#4989 (comment) and CocoaPods/CocoaPods#4989 (comment) doing git clone (without --shallow) is pretty efficient, at least in terms of CPU.

There were issues related to this with CocoaPods using github in this fashion a while back.
https://news.ycombinator.com/item?id=11245652

FYI, mentioned issue has been resolved by removing --shallow from git clone performed by CocoaPods client along with preflight checks for updates and partitioning 16k directory (see CocoaPods/CocoaPods#4989 (comment)).

@elorest
Copy link
Author

elorest commented Oct 30, 2017

Thanks for the response. If the solution for github load is to remove --shallow, then we end up having to download a lot of data for large or old repos with lots of history. This can be a problem for people with poor internet connections. Is cloning the repo without --shallow more efficient than downloading branch/tag archives for github? Because it definitely takes longer for the client.

@RX14
Copy link
Contributor

RX14 commented Oct 30, 2017

Cocoapods was special because it cloned a single repository every single time. Shards doesn't do that, every single shard is in its own repository.

We'd need to put in special logic to handle github if we wanted to do this, and currently we don't have that. We rely on a local git heck out and actually perform full git operations on the repo such as listing tags and checking out multiple tags. For this reason we need a full clone unless we also use the github api to replace this. So this is neither simple or neccesary I think.

@ysbaddaden
Copy link
Contributor

We could specialise the GitHub resolver, but we should verify that it's actually faster. Having to request the Git server for tags, for each shard.yml per Git refs, ... may end up being slower than cloning the repository. Not even talking of "duplicating" implementation logic to accommodate a specific Git service.

If I remember correctly, Bower doesnt specialise for GitHub but checks for a "smart" Git server and clones at --depth 1 if possible —then I don't know how it accesses tags, etc.

Maybe install could download a zip or a Git archive for the specified refs, to speed up repeated installs (e.g. on CI), but update will probably continue to do a full clone.

@ysbaddaden
Copy link
Contributor

Looking deeper into the CocoaPods issue, there are a couple things we could do to improve the situation:

  1. Have a per-user cache for cloned repositories, instead of a per-project (e.g. ~/.cache/shards);
  2. Use https://developer.github.com/changes/2016-02-24-commit-reference-sha-api/ to avoid long and slow git fetches;
  3. Have an install --download command to download tarballs instead of cloning (may avoid some lockfile checks, but that should ok);

@RX14
Copy link
Contributor

RX14 commented Oct 30, 2017

I don't think we should overcomplicate shards with too much optimizations. I'd go for a central repository index first in $XDG_CACHE_HOME/shards (which we should only have to update when we do shards update, since otherwise we have a sha or tag in the cache (or a cache miss)).

(2) is still pretty easy, just a simple API call, but it is github-specific. So i'd leave that off for now. (3) I don't think we should ever implement unless github communicates with us that we're causing excessive load.

@ysbaddaden
Copy link
Contributor

Point 2. is a mere method overload to call super or not. I expect the HTTP call to be faster than a Git fetch. Let's be honest: 99% of shards are hosted on GitHub and will benefit from it, so we should implement it.

@RX14
Copy link
Contributor

RX14 commented Oct 30, 2017

Surely you need to match the git url to github?

@ysbaddaden
Copy link
Contributor

@RX14
Copy link
Contributor

RX14 commented Oct 30, 2017

Seems slightly unexpected that it'd work for the github: user/repo syntax and not the git syntax. In my eyes the github syntax would simply be an alias and not have any different behaviour.

@marksiemers
Copy link

marksiemers commented Oct 30, 2017

I wanted to add an example of where this would be extremely helpful - mostly if any shard creator or maintainer accidentally forgets to gitignore something early in the project, then the git clone is stuck with that forever (unless git history is re-written or deleted).

One example shard, doing a git clone was downloading 30+ MB
The same shard, if just one release were downloaded it is between 2KB - 4KB

Even with shard caching, if a test suite involves doing a shards build - I don't think there will ever be a cache hit.

EDIT : I realized that with a global cache, there will be cache hits. With containers and when running tests on travis, those containers may have an empty global cache to start.

@ysbaddaden
Copy link
Contributor

Global cache is now available on the master branch, and should be alongside crystal's cache. You may customize it with SHARDS_CACHE_PATH, for example you may return to the previous behavior by exporting SHARDS_CACHE_PATH=.shards

@elorest
Copy link
Author

elorest commented Nov 17, 2017

Global Cache should help a lot.

One issue for people using github is that every branch/fork/pr that's ever existed remains in github repo's and can't be deleted or modified.
! [remote rejected] refs/pull/1/merge -> refs/pull/1/merge (deny updating a hidden ref)

This means that when using clone --mirror the repo sized can't be reduced. Is there a reason that shards needs to use git clone --mirror instead of just git clone?

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Nov 17, 2017

Shards needs all refs, because versions are tags and we can install any refs (branch, tag, commit). But maybe my understanding is wrong and --bare would be enough?

@jhass
Copy link
Member

jhass commented Nov 17, 2017

Worst case it could just try to fetch the specified ref and see if that works, no?

@elorest
Copy link
Author

elorest commented Nov 17, 2017

Even with a simple clone should be sufficient to switch to tags. @jhass your suggestion sounds good to me.

@faustinoaq
Copy link
Contributor

@ysbaddaden Thanks you so much for local cache feature! 🎉

Now I can install amber project dependencies faster:

before: ../shards/bin/shards  3.31s user 1.06s system 5% cpu 1:27.02 total
after: ../shards/bin/shards  1.63s user 0.31s system 8% cpu 23.473 total

We put a lot effort trying to reduce amber repo size, but --mirror flag doesn't help:

git clone --mirror [email protected]:amberframework/amber.git
Cloning into bare repository 'amber.git'...
Enter passphrase for key '/home/main/.ssh/id_rsa': 
remote: Counting objects: 12328, done.
remote: Compressing objects: 100% (481/481), done.
remote: Total 12328 (delta 402), reused 774 (delta 375), pack-reused 11471
Receiving objects: 100% (12328/12328), 7.78 MiB | 555.00 KiB/s, done.
Resolving deltas: 100% (6904/6904), done.

I think we can use --bare instead, WDYT?

git clone --bare [email protected]:amberframework/amber.git  
Cloning into bare repository 'amber.git'...
Enter passphrase for key '/home/main/.ssh/id_rsa': 
Enter passphrase for key '/home/main/.ssh/id_rsa': 
remote: Counting objects: 6657, done.
remote: Compressing objects: 100% (482/482), done.
remote: Total 6657 (delta 401), reused 763 (delta 377), pack-reused 5797
Receiving objects: 100% (6657/6657), 970.82 KiB | 203.00 KiB/s, done.
Resolving deltas: 100% (3661/3661), done.

@eliasjpr
Copy link

eliasjpr commented Dec 7, 2017

Maybe this could be as simple as adding a flag like shards install --base and it will default to --mirror

@faustinoaq
Copy link
Contributor

@ysbaddaden Do you think we can add an optional --bare flag?, something like shards install --bare

@elorest
Copy link
Author

elorest commented Dec 7, 2017

If --bare works is there any reason to use mirror? git clone --mirror is a lot larger than git clone --bare

@ysbaddaden
Copy link
Contributor

Done in 3a64a76. Tests are passing in local. Please try, but make sure to drop .shards or ~/.cache/shards beforehand.

@faustinoaq
Copy link
Contributor

faustinoaq commented Dec 12, 2017

@ysbaddaden I tested master and works fine, I tried installing different shards versions and all is working ok.

Also shards cache is lightweight and faster now. By example, instaling ambrockets dependencies::

Before:

$ rm -rf ~/.cache/shards lib shard.lock
$ time shards install
...
3.65s user 1.03s system 6% cpu 1:07.74 total
$ du -h -d 0 ~/.cache/shards
18M     /home/main/.cache/shards
$ du -h -d 1 .
48K     ./config
320K    ./.git
4.0K    ./bin
7.0M    ./lib
8.0K    ./spec
60K     ./public
116K    ./src
7.8M    .

After: (master)

$ rm -rf ~/.cache/shards lib shard.lock
$ time shards install
...
2.60s user 0.81s system 11% cpu 29.149 total
$ du -h -d 0 ~/.cache/shards
11M     /home/main/.cache/shards
$ du -h -d 1 .
48K     ./config
320K    ./.git
4.0K    ./bin
7.0M    ./lib
8.0K    ./spec
60K     ./public
116K    ./src
7.8M    .

@vtambourine
Copy link

Another use case for ability to download package instead of cloning shard repo is to support fast containers, such as Alpine Linux containers.
In my case, I'm using alpine image as a base image for containerizing my Crystal app. It doesn't have git installed by default, and adding git package only for installing shards seems quite excessive. I would rather have an option to specify release packages I need to build in my container. Something like:

dependencies:
  dotenv:
    package: https://github.com/petoem/envy/archive/v0.1.0.tar.gz

@luislavena
Copy link
Contributor

@vtambourine I know the following comments does not provide a solution to your request, but you can take advantage of multiple FROM layers in Dockerfiles, allowing you to have the compiler, build your program and then only take the necessary binaries.

Something similar to what Brian posted here:

https://manas.tech/blog/2017/04/03/shipping-crystal-apps-in-a-small-docker-image.html

And the repository:

https://github.com/bcardiff/miniserver

Hope that helps.

Cheers.

@vtambourine
Copy link

Thank you, @luislavena, that is really helpful! However, I really think Apline builds should also be quite useful.

@ysbaddaden
Copy link
Contributor

Closing. Both --bare and a global cache have been implemented (in master).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants