Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to vendor scripts in pathogen repos? #3

Closed
jameshadfield opened this issue Jul 11, 2023 · 11 comments
Closed

How to vendor scripts in pathogen repos? #3

jameshadfield opened this issue Jul 11, 2023 · 11 comments

Comments

@jameshadfield
Copy link
Member

jameshadfield commented Jul 11, 2023

This issue was originally written as an overview of git subtree, but later repurposed into a discussion of how to vendor scripts, specifically choosing between git subtree and git subrepo. In the end, we settled on git subrepo noting that this is a small implementation detail that can vary by pathogen repo, and be changed in the future.

Original issue

This is a summary of how I used git subtree as part of #2. Note that different pipelines can choose different methods of vendoring scripts from this repo, but git subtree is particularly nice as it requires no knowledge of its existence from a user of the pipeline.

Helpful reading: Git Subtree basics

The script was added to this repo (nextstrain/ingest) from within nextstrain/hepatitisB using ingest as a subtree. Specifically, from the hepB repo:

# use a branch (in hepB)
git checkout -b 'vendored-scripts'

# add the ingest repo as a subtre, using the 'apply-geolocation-rules' branch
git remote add ingest-remote [email protected]:nextstrain/ingest.git
git subtree add --prefix ingest/vendored ingest-remote apply-geolocation-rules --squash
# Adds a merge commit with one parent the previous host repo HEAD commit,
#     and the other a squashed commit of the 'ingest' repo

# move the script to the subtree repo (ingest/vendored), modify the snakemake
# rules accordingly and commit changes (to the hepB repo)

# push the changes up to the subtree repo (ingest) on branch apply-geolocation-rules
git subtree push --prefix ingest/vendored ingest-remote apply-geolocation-rules
# The commit message was identical, but only the changes to ingest/vendored
# were part of the subtree commit (probably obvious!)

It was tested in monkeypox by pulling in (this branch of) the ingest repo as a subtree, and updating the transform rule accordingly.

Reflections

This approach is pretty straightforward but changing the branch of a subtree seems to pollute the git history a bit. An alternative approach would be to simply have a subtree of ingest at the main branch, push any changes to the subtree up to a branch of the subtree, merge that branch via GitHub (with code review etc), then pull down the changes once they're on the main branch (of the subtree repo).

Given a script with differences in multiple repos, the most straightforward may be to simply to create a to-be-vendored version of a script locally, copy it into each repo to test, and when you are satisfied create a PR in this ingest repo without using subtrees at all. Once it's in main, it is straightforward to pull it into each pathogen repo using git subtree pull ....

Comments / improvements welcome. At the very least this may give others a quick start guide!

@victorlin
Copy link
Member

What's the use case of changing the branch of a subtree? Always pulling from nextstrain/ingest@main seems like a straightforward approach, and mimics how things work with other shared scripts in augur curate.

@joverlee521
Copy link
Contributor

I briefly explored git-subrepo as a potential alternative to git subtree.

Some things that are nice compared to git subtree:

  • It's obvious that a directory is a subrepo because it contains an auto-generated .gitrepo file
  • The .gitrepo file contains info on the remote source that the current subrepo is using.
  • Switching branches in the subrepo is painless and creates a single commit with message that summarizes the switch

I tested this out by creating a subrepo for nextstrain/ingest in my personal testing repo.
These two commits include the initial clone of the subrepo and a switch to the apply-geolocation-rules branch.

@jameshadfield
Copy link
Member Author

What's the use case of changing the branch of a subtree? Always pulling from nextstrain/ingest@main seems like a straightforward approach, and mimics how things work with other shared scripts in augur curate.

One use case is when a script needs modifications and you want to push them, however I think my subsequently suggested way to do this outside of subtree (or subrepo) is better. Hopefully we don't have too many occurrences of these.

git-subrepo looks good too. I'll give it a try. It shares the very nice feature that a user of the pipeline (and oftentimes the developer) doesn't need to know anything about it!

@victorlin
Copy link
Member

I tried this myself for the same apply-geolocation-rules script. See the following PRs:

Notes:

  • I did not use git subtree push. A summary of my approach can be found at nextstrain/hbv@c8c94af.
  • I did not use --squash since it can be useful to see the commit messages from nextstrain/ingest in pathogen repos.

@victorlin
Copy link
Member

victorlin commented Jul 12, 2023

@victorlin
Copy link
Member

victorlin commented Jul 12, 2023

Thoughts on subtree vs. subrepo after the experiments above:

  • subrepo does not provide a way to preserve commit messages from the other repo - it sees that as a problem rather than feature. This makes it comparable to using subtree with --squash. Between those two, subrepo has cleaner commit history:
    • subrepo makes 1 squash commit based on the history of the pathogen repo
    • subtree --squash makes 1 squash commit based on external history + 1 merge commit to bring it back into the pathogen repo's history
  • subrepo requires extra installation for anyone who wants to set up usage of centralized scripts using subrepo clone or pull new changes using subrepo pull.
  • With a pull-only approach, there is flexibility: individual pathogen repo maintainers can choose between subtree or subrepo. Of course, we can provide suggestions here.
  • subrepo's README lists several benefits compared to the built-in git commands. It's not yet clear to me which of these could be downsides of sticking with subtree, but it seems most of them can be avoided by the pull-only approach.

@joverlee521
Copy link
Contributor

We are going with git subtree, based on internal team discussion on Slack.

Added vendoring section to the README in 87ca1ed.

@victorlin
Copy link
Member

From #3 (comment):

I did not use --squash since it can be useful to see the commit messages from nextstrain/ingest in pathogen repos.

I just realized that these commit messages aren't visible when looking at the log of a particular file, but they are in the blame. Examples:

Having the blame might be useful, at the cost of a more verbose git history. I think the only way to truly understand the value here is to use it in practice for a bit!

@joverlee521
Copy link
Contributor

Oh, just figured out it's possible to see the commit history of a subrepo (without all of the git history pollution):

git subrepo fetch ingest/vendored
git log refs/subrepo/ingest/vendored/fetch 

You can also see commit history of a single file if you provide the filepath from the ingest repo:

git log refs/subrepo/ingest/vendored/fetch -- notify-slack

joverlee521 added a commit that referenced this issue Jul 31, 2023
Revisiting our previous decisions in #3
after @victorlin did more testing with the monkeypox repo with
`git subtree` in nextstrain/mpox#162 and
`git subrepo` in nextstrain/mpox#164.
@victorlin victorlin changed the title Using git subtree to add / vendor scripts How to vendor scripts in pathogen repos? Aug 1, 2023
@victorlin
Copy link
Member

I'm looking back at the Slack conversation and it seems like we were on the fence about subrepo vs. subtree and I chose one mostly to have a decision to move forward.

Since we've adopted git subrepo, we've ran into some issues such as #21, and recently @tsibley made a "bad suggestion" (which I don't think is that bad) to use git subtree. I think it can be reconsidered. If other people think so too, I'll make PRs to switch to git subtree in the pathogen repos, and we can see how it looks.

Things to consider:

  • We've already spent some time with git subrepo, and it works well enough. git subtree might come with new hurdles that we haven't discovered yet.
  • It's meant to be a temporarily solution, though external usage is likely with Add tutorials for ingest pipelines docs.nextstrain.org#179.
  • In the end, the difference for external usage is just an extra installation step for git subrepo which is straightforward.

@victorlin
Copy link
Member

Closing with status quo of git subrepo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

3 participants