populate-owners: Attempt pulls over HTTP #1791

wking · 2018-10-03T06:10:17Z

Spun off from #1761.

As discussed in #1601, pulling all projects serially is slow, because even with shallow clone you're pulling all the files in the HEAD tree. As discussed in e1f993f (#1285), GitHub does not enable git-upload-archive, so we can't use the Git protocol to select specific files.

One way to speed things up would be to allow fetching subsets of projects, but that risks corrupting OWNERS_ALIASES. If the utility only fetched some repositories, it would have an incomplete picture of aliases. For example, say the repository contained ci-operator/jobs/a/b and ci-operator/jobs/c/d, both of which defined aliases. A full run would result in aliases for both a/b and c/d in OWNERS_ALIASES. A subsequent run with only a/b would update any a/b aliases, but all other aliases, including those which had previously been injected by c/d, would be removed. You could work around that by caching information about which repositories were involved in OWNERS_ALIASES, but it would be a bit complicated.

This commit, on the other hand, takes advantage of the fact that most of our repositories are public. It optimistically attempts an efficient HTTP pull, and only falls back to the shallow clone if the efficient pull fails. That's slightly more work for private repositories, where the HTTP attempt is wasted effort, but it's much more efficient for public repos.

Another potential issue is consuming your API quota of 60 unauthenticated requests per hour. But if you hit that limit, we just fall back to Git, so it's not the end of the world. @stevekuznetsov brought up the NAT-ed office side of this here, and if waiting on SSH is still not sufficient for those users, I can authenticate our HTTP requests. Do folks have a preference among GitHub's auth methods?

Another alterantive would be to pull in parallel (as we used to before e1f993f), but that still has the bandwidth cost of shifting lots of files that we don't care about for this use case. You could save some bandwidth by caching blobs locally, but then you'd have the complication of cache management.

The got owners ... message is because the HTTP pulls don't produce any terminal output. The Git pulls write status updates to stderr, but now that it might be a while between Git pulls, it's good to let the caller know something is happening ;).

Docs for the commit endpoint I'm hitting are here.

I've also tacked on 5f851d4 to break a cyclic injection of our own OWNERS_ALIASES. Let me know if you want that spun off into a separate PR.

stevekuznetsov · 2018-10-03T07:52:52Z

I think this is a net benefit so it's likely fine as long as it does not impact the bottom line for people who are not going to be able to take advantage of the HTTP pulls anyway. The ability to target a specific org/repo is likely the more useful feature.

wking · 2018-10-04T04:09:58Z

I think this is a net benefit so it's likely fine as long as it does not impact the bottom line for people who are not going to be able to take advantage of the HTTP pulls anyway.

The impact on those folks should be negligabe. One quick 403 per repo, and for these folks each each repo will be transfering the whole HEAD tree over SSH. I haven't benchmarked it though.

The ability to target a specific org/repo is likely the more useful feature.

Worth the trouble of caching alias provenance? Adding auth to the HTTP pulls seems much more srraightforward, I just don't know which auth flavor you want.

As discussed in #1601, pulling all projects serially is slow, because even with shallow clone you're pulling all the files in the HEAD tree. As discussed in e1f993f (populate-owners: Also slurp OWNERS_ALIASES, 2018-08-25, openshift#1285), GitHub does not enable git-upload-archive, so we can't use the Git protocol to select specific files. One way to speed things up would be to allow fetching subsets of projects, but that risks corrupting OWNERS_ALIASES. If the utility only fetched some repositories, it would have an incomplete picture of aliases. For example, say the repository contained ci-operator/jobs/a/b and ci-operator/jobs/c/d, both of which defined aliases. A full run would result in aliases for both a/b and c/d in OWNERS_ALIASES. A subsequent run with only a/b would update any a/b aliases, but all other aliases, including those which had previously been injected by c/d, would be removed. You could work around that by caching information about which repositories were involved in OWNERS_ALIASES, but it would be a bit complicated. This commit, on the other hand, takes advantage of the fact that most of our repositories are public. It optimistically attempts an efficient HTTP pull, and only falls back to the shallow clone if the efficient pull fails. That's slightly more work for private repositories, where the HTTP attempt is wasted effort, but it's much more efficient for public repos. Another potential issue is consuming your API quota of 60 unauthenticated requests per hour [1]. But if you hit that limit, we just fall back to Git, so it's not the end of the world. Another alterantive would be to pull in parallel (as we used to before e1f993f), but that still has the bandwidth cost of shifting lots of files that we don't care about for this use case. You could save some bandwidth by caching blobs locally, but then you'd have the complication of cache management. The "got owners ..." message is because the HTTP pulls don't produce any terminal output. The Git pulls write status updates to stderr, but now that it might be a while between Git pulls, it's good to let the caller know something is happening ;). Docs for the commit endpoint I'm hitting are in [2]. [1]: https://developer.github.com/v3/#rate-limiting [2]: https://developer.github.com/v3/repos/commits/#get-the-sha-1-of-a-commit-reference

Before this commit, the ci-operator/jobs/openshift/release directory caused an openshift/release entry to be injected in our org/repo slice. That lead to our own aliases being looped back around and re-injected into our OWNERS_ALIASES with things like: openshift-installer-installer-approvers: - aaronlevy ... openshift-release-installer-approvers: - aaronlevy ... With this commit we break that cycle.

wking · 2018-10-04T04:59:27Z

Rebased onto master with cc625ba -> 6616fd2 to pick up #1792 and get an OWNERS file in there.

stevekuznetsov · 2018-10-04T06:33:52Z

You could add OAuth to it, but it seems more complex than it needs to be.

/lgtm

openshift-ci-robot · 2018-10-04T06:33:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stevekuznetsov, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [stevekuznetsov]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot requested review from abhgupta and abhinavdahiya October 3, 2018 06:10

openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. sig/azure Categorizes item related to Azure jobs labels Oct 3, 2018

wking mentioned this pull request Oct 3, 2018

cluster-launch-installer-e2e: Start setting expirationDate again #1761

Merged

wking added 3 commits October 3, 2018 21:39

Auto-generate OWNERS with tools/populate-owners

6616fd2

wking force-pushed the pull-owners-over-http branch from cc625ba to 6616fd2 Compare October 4, 2018 04:58

openshift-ci-robot assigned stevekuznetsov Oct 4, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 4, 2018

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 4, 2018

openshift-merge-robot merged commit 31c535c into openshift:master Oct 4, 2018

wking deleted the pull-owners-over-http branch October 4, 2018 06:47

pgier mentioned this pull request Jan 12, 2019

populate-owners: resolve owner aliases #2572

Merged

wking mentioned this pull request Nov 25, 2019

Autoowners openshift/ci-tools#115

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

populate-owners: Attempt pulls over HTTP #1791

populate-owners: Attempt pulls over HTTP #1791

wking commented Oct 3, 2018 •

edited

Loading

stevekuznetsov commented Oct 3, 2018

wking commented Oct 4, 2018

wking commented Oct 4, 2018

stevekuznetsov commented Oct 4, 2018

openshift-ci-robot commented Oct 4, 2018

populate-owners: Attempt pulls over HTTP #1791

populate-owners: Attempt pulls over HTTP #1791

Conversation

wking commented Oct 3, 2018 • edited Loading

stevekuznetsov commented Oct 3, 2018

wking commented Oct 4, 2018

wking commented Oct 4, 2018

stevekuznetsov commented Oct 4, 2018

openshift-ci-robot commented Oct 4, 2018

wking commented Oct 3, 2018 •

edited

Loading