Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

populate-owners: Attempt pulls over HTTP #1791

Merged
merged 3 commits into from
Oct 4, 2018

Conversation

wking
Copy link
Member

@wking wking commented Oct 3, 2018

Spun off from #1761.

As discussed in #1601, pulling all projects serially is slow, because even with shallow clone you're pulling all the files in the HEAD tree. As discussed in e1f993f (#1285), GitHub does not enable git-upload-archive, so we can't use the Git protocol to select specific files.

One way to speed things up would be to allow fetching subsets of projects, but that risks corrupting OWNERS_ALIASES. If the utility only fetched some repositories, it would have an incomplete picture of aliases. For example, say the repository contained ci-operator/jobs/a/b and ci-operator/jobs/c/d, both of which defined aliases. A full run would result in aliases for both a/b and c/d in OWNERS_ALIASES. A subsequent run with only a/b would update any a/b aliases, but all other aliases, including those which had previously been injected by c/d, would be removed. You could work around that by caching information about which repositories were involved in OWNERS_ALIASES, but it would be a bit complicated.

This commit, on the other hand, takes advantage of the fact that most of our repositories are public. It optimistically attempts an efficient HTTP pull, and only falls back to the shallow clone if the efficient pull fails. That's slightly more work for private repositories, where the HTTP attempt is wasted effort, but it's much more efficient for public repos.

Another potential issue is consuming your API quota of 60 unauthenticated requests per hour. But if you hit that limit, we just fall back to Git, so it's not the end of the world. @stevekuznetsov brought up the NAT-ed office side of this here, and if waiting on SSH is still not sufficient for those users, I can authenticate our HTTP requests. Do folks have a preference among GitHub's auth methods?

Another alterantive would be to pull in parallel (as we used to before e1f993f), but that still has the bandwidth cost of shifting lots of files that we don't care about for this use case. You could save some bandwidth by caching blobs locally, but then you'd have the complication of cache management.

The got owners ... message is because the HTTP pulls don't produce any terminal output. The Git pulls write status updates to stderr, but now that it might be a while between Git pulls, it's good to let the caller know something is happening ;).

Docs for the commit endpoint I'm hitting are here.

I've also tacked on 5f851d4 to break a cyclic injection of our own OWNERS_ALIASES. Let me know if you want that spun off into a separate PR.

@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. sig/azure Categorizes item related to Azure jobs labels Oct 3, 2018
@stevekuznetsov
Copy link
Contributor

I think this is a net benefit so it's likely fine as long as it does not impact the bottom line for people who are not going to be able to take advantage of the HTTP pulls anyway. The ability to target a specific org/repo is likely the more useful feature.

@wking
Copy link
Member Author

wking commented Oct 4, 2018

I think this is a net benefit so it's likely fine as long as it does not impact the bottom line for people who are not going to be able to take advantage of the HTTP pulls anyway.

The impact on those folks should be negligabe. One quick 403 per repo, and for these folks each each repo will be transfering the whole HEAD tree over SSH. I haven't benchmarked it though.

The ability to target a specific org/repo is likely the more useful feature.

Worth the trouble of caching alias provenance? Adding auth to the HTTP pulls seems much more srraightforward, I just don't know which auth flavor you want.

wking added 3 commits October 3, 2018 21:39
As discussed in #1601, pulling all projects serially is slow, because
even with shallow clone you're pulling all the files in the HEAD tree.
As discussed in e1f993f (populate-owners: Also slurp OWNERS_ALIASES,
2018-08-25, openshift#1285), GitHub does not enable git-upload-archive, so
we can't use the Git protocol to select specific files.

One way to speed things up would be to allow fetching subsets of
projects, but that risks corrupting OWNERS_ALIASES.  If the utility
only fetched some repositories, it would have an incomplete picture of
aliases.  For example, say the repository contained
ci-operator/jobs/a/b and ci-operator/jobs/c/d, both of which defined
aliases.  A full run would result in aliases for both a/b and c/d in
OWNERS_ALIASES.  A subsequent run with only a/b would update any a/b
aliases, but all other aliases, including those which had previously
been injected by c/d, would be removed.  You could work around that by
caching information about which repositories were involved in
OWNERS_ALIASES, but it would be a bit complicated.

This commit, on the other hand, takes advantage of the fact that most
of our repositories are public.  It optimistically attempts an
efficient HTTP pull, and only falls back to the shallow clone if the
efficient pull fails.  That's slightly more work for private
repositories, where the HTTP attempt is wasted effort, but it's much
more efficient for public repos.

Another potential issue is consuming your API quota of 60
unauthenticated requests per hour [1].  But if you hit that limit, we
just fall back to Git, so it's not the end of the world.

Another alterantive would be to pull in parallel (as we used to before
e1f993f), but that still has the bandwidth cost of shifting lots of
files that we don't care about for this use case.  You could save some
bandwidth by caching blobs locally, but then you'd have the
complication of cache management.

The "got owners ..." message is because the HTTP pulls don't produce
any terminal output.  The Git pulls write status updates to stderr,
but now that it might be a while between Git pulls, it's good to let
the caller know something is happening ;).

Docs for the commit endpoint I'm hitting are in [2].

[1]: https://developer.github.com/v3/#rate-limiting
[2]: https://developer.github.com/v3/repos/commits/#get-the-sha-1-of-a-commit-reference
Before this commit, the ci-operator/jobs/openshift/release directory
caused an openshift/release entry to be injected in our org/repo
slice.  That lead to our own aliases being looped back around and
re-injected into our OWNERS_ALIASES with things like:

  openshift-installer-installer-approvers:
    - aaronlevy
    ...
  openshift-release-installer-approvers:
    - aaronlevy
    ...

With this commit we break that cycle.
@wking wking force-pushed the pull-owners-over-http branch from cc625ba to 6616fd2 Compare October 4, 2018 04:58
@wking
Copy link
Member Author

wking commented Oct 4, 2018

Rebased onto master with cc625ba -> 6616fd2 to pick up #1792 and get an OWNERS file in there.

@stevekuznetsov
Copy link
Contributor

You could add OAuth to it, but it seems more complex than it needs to be.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 4, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stevekuznetsov, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 4, 2018
@openshift-merge-robot openshift-merge-robot merged commit 31c535c into openshift:master Oct 4, 2018
@wking wking deleted the pull-owners-over-http branch October 4, 2018 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. sig/azure Categorizes item related to Azure jobs size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants