-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
populate-owners: Attempt pulls over HTTP #1791
populate-owners: Attempt pulls over HTTP #1791
Conversation
I think this is a net benefit so it's likely fine as long as it does not impact the bottom line for people who are not going to be able to take advantage of the HTTP pulls anyway. The ability to target a specific org/repo is likely the more useful feature. |
The impact on those folks should be negligabe. One quick 403 per repo, and for these folks each each repo will be transfering the whole HEAD tree over SSH. I haven't benchmarked it though.
Worth the trouble of caching alias provenance? Adding auth to the HTTP pulls seems much more srraightforward, I just don't know which auth flavor you want. |
As discussed in #1601, pulling all projects serially is slow, because even with shallow clone you're pulling all the files in the HEAD tree. As discussed in e1f993f (populate-owners: Also slurp OWNERS_ALIASES, 2018-08-25, openshift#1285), GitHub does not enable git-upload-archive, so we can't use the Git protocol to select specific files. One way to speed things up would be to allow fetching subsets of projects, but that risks corrupting OWNERS_ALIASES. If the utility only fetched some repositories, it would have an incomplete picture of aliases. For example, say the repository contained ci-operator/jobs/a/b and ci-operator/jobs/c/d, both of which defined aliases. A full run would result in aliases for both a/b and c/d in OWNERS_ALIASES. A subsequent run with only a/b would update any a/b aliases, but all other aliases, including those which had previously been injected by c/d, would be removed. You could work around that by caching information about which repositories were involved in OWNERS_ALIASES, but it would be a bit complicated. This commit, on the other hand, takes advantage of the fact that most of our repositories are public. It optimistically attempts an efficient HTTP pull, and only falls back to the shallow clone if the efficient pull fails. That's slightly more work for private repositories, where the HTTP attempt is wasted effort, but it's much more efficient for public repos. Another potential issue is consuming your API quota of 60 unauthenticated requests per hour [1]. But if you hit that limit, we just fall back to Git, so it's not the end of the world. Another alterantive would be to pull in parallel (as we used to before e1f993f), but that still has the bandwidth cost of shifting lots of files that we don't care about for this use case. You could save some bandwidth by caching blobs locally, but then you'd have the complication of cache management. The "got owners ..." message is because the HTTP pulls don't produce any terminal output. The Git pulls write status updates to stderr, but now that it might be a while between Git pulls, it's good to let the caller know something is happening ;). Docs for the commit endpoint I'm hitting are in [2]. [1]: https://developer.github.com/v3/#rate-limiting [2]: https://developer.github.com/v3/repos/commits/#get-the-sha-1-of-a-commit-reference
Before this commit, the ci-operator/jobs/openshift/release directory caused an openshift/release entry to be injected in our org/repo slice. That lead to our own aliases being looped back around and re-injected into our OWNERS_ALIASES with things like: openshift-installer-installer-approvers: - aaronlevy ... openshift-release-installer-approvers: - aaronlevy ... With this commit we break that cycle.
cc625ba
to
6616fd2
Compare
Rebased onto master with cc625ba -> 6616fd2 to pick up #1792 and get an |
You could add OAuth to it, but it seems more complex than it needs to be. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: stevekuznetsov, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Spun off from #1761.
As discussed in #1601, pulling all projects serially is slow, because even with shallow clone you're pulling all the files in the HEAD tree. As discussed in e1f993f (#1285), GitHub does not enable git-upload-archive, so we can't use the Git protocol to select specific files.
One way to speed things up would be to allow fetching subsets of projects, but that risks corrupting
OWNERS_ALIASES
. If the utility only fetched some repositories, it would have an incomplete picture of aliases. For example, say the repository containedci-operator/jobs/a/b
andci-operator/jobs/c/d
, both of which defined aliases. A full run would result in aliases for botha/b
andc/d
inOWNERS_ALIASES
. A subsequent run with onlya/b
would update anya/b
aliases, but all other aliases, including those which had previously been injected byc/d
, would be removed. You could work around that by caching information about which repositories were involved inOWNERS_ALIASES
, but it would be a bit complicated.This commit, on the other hand, takes advantage of the fact that most of our repositories are public. It optimistically attempts an efficient HTTP pull, and only falls back to the shallow clone if the efficient pull fails. That's slightly more work for private repositories, where the HTTP attempt is wasted effort, but it's much more efficient for public repos.
Another potential issue is consuming your API quota of 60 unauthenticated requests per hour. But if you hit that limit, we just fall back to Git, so it's not the end of the world. @stevekuznetsov brought up the NAT-ed office side of this here, and if waiting on SSH is still not sufficient for those users, I can authenticate our HTTP requests. Do folks have a preference among GitHub's auth methods?
Another alterantive would be to pull in parallel (as we used to before e1f993f), but that still has the bandwidth cost of shifting lots of files that we don't care about for this use case. You could save some bandwidth by caching blobs locally, but then you'd have the complication of cache management.
The
got owners ...
message is because the HTTP pulls don't produce any terminal output. The Git pulls write status updates to stderr, but now that it might be a while between Git pulls, it's good to let the caller know something is happening ;).Docs for the commit endpoint I'm hitting are here.
I've also tacked on 5f851d4 to break a cyclic injection of our own
OWNERS_ALIASES
. Let me know if you want that spun off into a separate PR.