[WIP] Feature Autoscaling: Enable organization runner autoscaler #213

erikkn · 2020-11-26T18:34:54Z

This PR introduces a workaround for autoscaling your organization runners. At the moment, Github doesn't provide an API endpoint that gives you a number of how full your (organizational) queue is. This PR tries to implement a workaround for this, by checking all the repos in an organization for which Actions is enabled and subsequently checks the latest changed repositories. It does the latter by calculating how many minutes ago a repo got changed and compares this number with the SyncPeriod time.
Subsequently, the controller will watch the queues of these repositories and scale up/down.

erikkn · 2020-11-26T18:36:00Z

Fixes #158

controllers/autoscaling.go

erikkn · 2020-12-03T08:16:16Z

Currently blocked on this: #221

ghost · 2020-12-04T16:58:53Z

One other comment - some api's that you will be depending on in the latest go-github release might not be available for github enterprise users (e.g. get action enabled repositories exists for teams, pro versions of github but not for github enterprise) This will break deployments. We should use the strategy pattern here.

It turned out previous versions of runner images were unable to run actions that require `AGENT_TOOLSDIRECTORY` or `libyaml` to exist in the runner environment. One of notable examples of such actions is [`ruby/setup-ruby`](https://github.com/ruby/setup-ruby). This change adds the support for those actions, by setting up AGENT_TOOLSDIRECTORY and installing libyaml-dev within runner images.

erikkn · 2020-12-08T11:01:34Z

@ZacharyBenamram thank you again for your reviews! What do you think of the current change proposal? I haven't updated the tests yet, because I wanted to hear your opinion on the matter first.

ghost · 2020-12-08T21:24:26Z

controllers/autoscaling.go

+		return nil, fmt.Errorf("validating autoscaling metrics: one or more metrics is required")
+	} else if tpe := metrics[0].Type; tpe != v1alpha1.AutoscalingMetricTypeTotalNumberOfQueuedAndInProgressWorkflowRuns {
+		return nil, fmt.Errorf("validting autoscaling metrics: unsupported metric type %q: only supported value is %s", tpe, v1alpha1.AutoscalingMetricTypeTotalNumberOfQueuedAndInProgressWorkflowRuns)
+	}


I would actually add another metrics type here instead of overwriting the previous implementation. What do you think about what I've written in https://github.com/summerwind/actions-runner-controller/pull/223/files#diff-2fb5cc41bbf098ed839a501baea824efe104844213ce3969a63523439a132600R31

That way https://github.com/summerwind/actions-runner-controller/pull/213/files#diff-2fb5cc41bbf098ed839a501baea824efe104844213ce3969a63523439a132600R41 wont break deployments on github enterprise

Awesome work you did on your PR!

So when I look at the new calculateReplicasByPercentageRunnersBusy method you introduce, you are essentially fetching all the runners and subsequently iterate over that list by checking which one is busy. Next, you have a couple of conditionals that instructs you to either scale up or down.

I like it because this also allows us to get completely rid of checking queues on an individual repository level. What do you reckon to do next here? I haven't given your PR a very thorough review yet, but with yours merged I guess we can close this one?
I am not sure if it makes sense of cherry-picking your changes, refactor this PR accordingly and keeping the logic, including the ugly
if len(metrics[0].RepositoryNames) < 1 && r.GitHubClient.GithubEnterprise conditional, specifically just for the calculateReplicasByQueuedAndInProgressWorkflowRuns metric.

I think your hpa scheme is also very much needed. I recommend you create another hpa scheme and add it into this if else block and pull out your changes into a separate func. Something like v1aplha1.AutoscalingMetricTypeNumQueuedAndInProgressForEnabledRepositories and then leave the original scheme as is.

I don't mind keeping these as separate PRs and if yours goes in first, merge your changes into mine afterwards. What are your thoughts?

callum-tait-pbx · 2021-01-26T15:59:56Z

@erikkn any update on this?

erikkn · 2021-02-01T19:26:31Z

@callum-tait-pbx , haven't really had the time to work on this. Are you waiting for this, or can you use PercentageRunnersBusy ?

callum-tait-pbx · 2021-02-02T10:56:57Z

Tbh I haven't had a chance to get around to testing the the PercentageRunnersBusy metric. I am expecting it to work for our needs as we just want a single RunnerDeployment defined to scale for the whole GitHub org without the need to maintain a specific list of repositories for that runner. We're going to be doing most of the legwork of configuring the runner environment via actions themselves instead of baking tools into a container etc so we don't have a need to segregate RunnerDeployments to specific repositories. I think the PercentageRunnersBusy metric will do exactly this but as I say, haven't had a chance to test it as I've been sorting out the stuff we need around this solution. Thought I'd bump this PR though just to see if an alternative was still in the pipeline just in case PercentageRunnersBusy doesn't work for us some reason.

erikkn · 2021-02-02T11:53:26Z

Tbh I haven't had a chance to get around to testing the the PercentageRunnersBusy metric. I am expecting it to work for our needs as we just want a single RunnerDeployment defined to scale for the whole GitHub org without the need to maintain a specific list of repositories for that runner. We're going to be doing most of the legwork of configuring the runner environment via actions themselves instead of baking tools into a container etc so we don't have a need to segregate RunnerDeployments to specific repositories. I think the PercentageRunnersBusy metric will do exactly this but as I say, haven't had a chance to test it as I've been sorting out the stuff we need around this solution. Thought I'd bump this PR though just to see if an alternative was still in the pipeline just in case PercentageRunnersBusy doesn't work for us some reason.

Fair enough, but how will you deal with patching your image (and the binaries living in those images)? And how to make changes to your environment without changing XYZ number of repos?

I will try to work on this PR this week btw.

callum-tait-pbx · 2021-02-02T21:33:00Z

Fair enough, but how will you deal with patching your image (and the binaries living in those images)? And how to make changes to your environment without changing XYZ number of repos?

We intend to use actions themselves for environment setup rather than baking tools directly into the runner images. So to use a public action as an example, instead of having node baked into the runner image we will use https://github.com/actions/setup-node to configure the required node version as defined in the workflow rather than having it baked into the underlying container. Obviously there are some cons doing it that way but we believe it will be a net positive approach for multiple reasons. We may end up having to bake some tooling into the runner image; it should be very limited beyond some internal tooling. We'd rather build out custom actions for things we can't find an appropriate open source action for than go down the other possible routes of lots of custom images or fewer but more complicated super custom images.

Actions can be version pinned so once we publish a new version of an action it’s a case of communicating and socialising internally. We’ll support X number of releases and then the actions will go into a deprecation process, basically a software house in a software house.

There are some good boilerplate templates to start from for creating actions e..g https://github.com/jacobtomlinson/python-container-action https://github.com/jacobtomlinson/go-container-action

I also imagine in reality this strategy will work for most of our workflows however there will be some legacy stuff that will get their own dedicated runners with more custom runner images. I’ve got a repository setup for creating custom images so we’ll be able to provide those easily enough as we need to produce a internal image regardless for our minimal image approach but it's built for a "library" of runners.

Seen we are going to patching very minimally, becuase of the lack of tools baked into the image, we can either have an outage to do it or spin up a second runnerdeployment with the new container image and migrate workflows over through labels in a sort of blue green fashion (having tested this on our tested on our test cluster first ofc :D)

I will try to work on this PR this week btw.

That would be awsome 🥇

erikkn · 2021-02-25T09:44:49Z

Unfortunately, I started working on other projects so it seems like I am not gonna have time to work on this PR any time soon. I will close the PR now because it is open for 3 months now and nobody showed interest in picking it up.

ghost reviewed Nov 26, 2020

View reviewed changes

controllers/autoscaling.go Show resolved Hide resolved

erikkn changed the title ~~Feature Autoscaling: Enable organization runner autoscaler~~ [WIP] Feature Autoscaling: Enable organization runner autoscaler Nov 27, 2020

ghost mentioned this pull request Dec 4, 2020

Horizontal Pod Autoscaler - Busy Runner Scheme #223

Merged

erikkn force-pushed the org-autoscaler branch from f07ac29 to 30725bc Compare December 7, 2020 19:25

mumoshu and others added 2 commits December 7, 2020 20:27

Use ListEnabledRepos & check within sync-period time

2534945

erikkn force-pushed the org-autoscaler branch from 30725bc to 2534945 Compare December 7, 2020 19:27

ghost reviewed Dec 8, 2020

View reviewed changes

erikkn closed this Feb 25, 2021

mumoshu mentioned this pull request Mar 8, 2021

Autoscaling for an organization runner #158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Feature Autoscaling: Enable organization runner autoscaler #213

[WIP] Feature Autoscaling: Enable organization runner autoscaler #213

erikkn commented Nov 26, 2020 •

edited

Loading

erikkn commented Nov 26, 2020

erikkn commented Dec 3, 2020

ghost commented Dec 4, 2020

erikkn commented Dec 8, 2020 •

edited

Loading

ghost Dec 8, 2020

ghost Dec 8, 2020

erikkn Dec 8, 2020 •

edited

Loading

ghost Dec 9, 2020 •

edited by ghost

Loading

callum-tait-pbx commented Jan 26, 2021

erikkn commented Feb 1, 2021 •

edited

Loading

callum-tait-pbx commented Feb 2, 2021 •

edited

Loading

erikkn commented Feb 2, 2021

callum-tait-pbx commented Feb 2, 2021 •

edited

Loading

erikkn commented Feb 25, 2021

[WIP] Feature Autoscaling: Enable organization runner autoscaler #213

[WIP] Feature Autoscaling: Enable organization runner autoscaler #213

Conversation

erikkn commented Nov 26, 2020 • edited Loading

erikkn commented Nov 26, 2020

erikkn commented Dec 3, 2020

ghost commented Dec 4, 2020

erikkn commented Dec 8, 2020 • edited Loading

ghost Dec 8, 2020

Choose a reason for hiding this comment

ghost Dec 8, 2020

Choose a reason for hiding this comment

erikkn Dec 8, 2020 • edited Loading

Choose a reason for hiding this comment

ghost Dec 9, 2020 • edited by ghost Loading

Choose a reason for hiding this comment

callum-tait-pbx commented Jan 26, 2021

erikkn commented Feb 1, 2021 • edited Loading

callum-tait-pbx commented Feb 2, 2021 • edited Loading

erikkn commented Feb 2, 2021

callum-tait-pbx commented Feb 2, 2021 • edited Loading

erikkn commented Feb 25, 2021

erikkn commented Nov 26, 2020 •

edited

Loading

erikkn commented Dec 8, 2020 •

edited

Loading

erikkn Dec 8, 2020 •

edited

Loading

ghost Dec 9, 2020 •

edited by ghost

Loading

erikkn commented Feb 1, 2021 •

edited

Loading

callum-tait-pbx commented Feb 2, 2021 •

edited

Loading

callum-tait-pbx commented Feb 2, 2021 •

edited

Loading