Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Feature Autoscaling: Enable organization runner autoscaler #213

Closed
wants to merge 2 commits into from

Conversation

erikkn
Copy link
Contributor

@erikkn erikkn commented Nov 26, 2020

This PR introduces a workaround for autoscaling your organization runners. At the moment, Github doesn't provide an API endpoint that gives you a number of how full your (organizational) queue is. This PR tries to implement a workaround for this, by checking all the repos in an organization for which Actions is enabled and subsequently checks the latest changed repositories. It does the latter by calculating how many minutes ago a repo got changed and compares this number with the SyncPeriod time.
Subsequently, the controller will watch the queues of these repositories and scale up/down.

@erikkn
Copy link
Contributor Author

erikkn commented Nov 26, 2020

Fixes #158

@erikkn erikkn changed the title Feature Autoscaling: Enable organization runner autoscaler [WIP] Feature Autoscaling: Enable organization runner autoscaler Nov 27, 2020
@erikkn
Copy link
Contributor Author

erikkn commented Dec 3, 2020

Currently blocked on this: #221

@ghost
Copy link

ghost commented Dec 4, 2020

One other comment - some api's that you will be depending on in the latest go-github release might not be available for github enterprise users (e.g. get action enabled repositories exists for teams, pro versions of github but not for github enterprise) This will break deployments. We should use the strategy pattern here.

mumoshu and others added 2 commits December 7, 2020 20:27
It turned out previous versions of runner images were unable to run actions that require `AGENT_TOOLSDIRECTORY` or `libyaml` to exist in the runner environment. One of notable examples of such actions is [`ruby/setup-ruby`](https://github.com/ruby/setup-ruby).

This change adds the support for those actions, by setting up AGENT_TOOLSDIRECTORY and installing libyaml-dev within runner images.
@erikkn
Copy link
Contributor Author

erikkn commented Dec 8, 2020

@ZacharyBenamram thank you again for your reviews! What do you think of the current change proposal? I haven't updated the tests yet, because I wanted to hear your opinion on the matter first.

return nil, fmt.Errorf("validating autoscaling metrics: one or more metrics is required")
} else if tpe := metrics[0].Type; tpe != v1alpha1.AutoscalingMetricTypeTotalNumberOfQueuedAndInProgressWorkflowRuns {
return nil, fmt.Errorf("validting autoscaling metrics: unsupported metric type %q: only supported value is %s", tpe, v1alpha1.AutoscalingMetricTypeTotalNumberOfQueuedAndInProgressWorkflowRuns)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually add another metrics type here instead of overwriting the previous implementation. What do you think about what I've written in https://github.com/summerwind/actions-runner-controller/pull/223/files#diff-2fb5cc41bbf098ed839a501baea824efe104844213ce3969a63523439a132600R31

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@erikkn erikkn Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work you did on your PR!

So when I look at the new calculateReplicasByPercentageRunnersBusy method you introduce, you are essentially fetching all the runners and subsequently iterate over that list by checking which one is busy. Next, you have a couple of conditionals that instructs you to either scale up or down.

I like it because this also allows us to get completely rid of checking queues on an individual repository level. What do you reckon to do next here? I haven't given your PR a very thorough review yet, but with yours merged I guess we can close this one?
I am not sure if it makes sense of cherry-picking your changes, refactor this PR accordingly and keeping the logic, including the ugly
if len(metrics[0].RepositoryNames) < 1 && r.GitHubClient.GithubEnterprise conditional, specifically just for the calculateReplicasByQueuedAndInProgressWorkflowRuns metric.

Copy link

@ghost ghost Dec 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your hpa scheme is also very much needed. I recommend you create another hpa scheme and add it into this if else block and pull out your changes into a separate func. Something like v1aplha1.AutoscalingMetricTypeNumQueuedAndInProgressForEnabledRepositories and then leave the original scheme as is.

I don't mind keeping these as separate PRs and if yours goes in first, merge your changes into mine afterwards. What are your thoughts?

@callum-tait-pbx
Copy link
Contributor

@erikkn any update on this?

@erikkn
Copy link
Contributor Author

erikkn commented Feb 1, 2021

@callum-tait-pbx , haven't really had the time to work on this. Are you waiting for this, or can you use PercentageRunnersBusy ?

@callum-tait-pbx
Copy link
Contributor

callum-tait-pbx commented Feb 2, 2021

Tbh I haven't had a chance to get around to testing the the PercentageRunnersBusy metric. I am expecting it to work for our needs as we just want a single RunnerDeployment defined to scale for the whole GitHub org without the need to maintain a specific list of repositories for that runner. We're going to be doing most of the legwork of configuring the runner environment via actions themselves instead of baking tools into a container etc so we don't have a need to segregate RunnerDeployments to specific repositories. I think the PercentageRunnersBusy metric will do exactly this but as I say, haven't had a chance to test it as I've been sorting out the stuff we need around this solution. Thought I'd bump this PR though just to see if an alternative was still in the pipeline just in case PercentageRunnersBusy doesn't work for us some reason.

@erikkn
Copy link
Contributor Author

erikkn commented Feb 2, 2021

Tbh I haven't had a chance to get around to testing the the PercentageRunnersBusy metric. I am expecting it to work for our needs as we just want a single RunnerDeployment defined to scale for the whole GitHub org without the need to maintain a specific list of repositories for that runner. We're going to be doing most of the legwork of configuring the runner environment via actions themselves instead of baking tools into a container etc so we don't have a need to segregate RunnerDeployments to specific repositories. I think the PercentageRunnersBusy metric will do exactly this but as I say, haven't had a chance to test it as I've been sorting out the stuff we need around this solution. Thought I'd bump this PR though just to see if an alternative was still in the pipeline just in case PercentageRunnersBusy doesn't work for us some reason.

Fair enough, but how will you deal with patching your image (and the binaries living in those images)? And how to make changes to your environment without changing XYZ number of repos?

I will try to work on this PR this week btw.

@callum-tait-pbx
Copy link
Contributor

callum-tait-pbx commented Feb 2, 2021

Fair enough, but how will you deal with patching your image (and the binaries living in those images)? And how to make changes to your environment without changing XYZ number of repos?

We intend to use actions themselves for environment setup rather than baking tools directly into the runner images. So to use a public action as an example, instead of having node baked into the runner image we will use https://github.com/actions/setup-node to configure the required node version as defined in the workflow rather than having it baked into the underlying container. Obviously there are some cons doing it that way but we believe it will be a net positive approach for multiple reasons. We may end up having to bake some tooling into the runner image; it should be very limited beyond some internal tooling. We'd rather build out custom actions for things we can't find an appropriate open source action for than go down the other possible routes of lots of custom images or fewer but more complicated super custom images.

Actions can be version pinned so once we publish a new version of an action it’s a case of communicating and socialising internally. We’ll support X number of releases and then the actions will go into a deprecation process, basically a software house in a software house.

There are some good boilerplate templates to start from for creating actions e..g https://github.com/jacobtomlinson/python-container-action https://github.com/jacobtomlinson/go-container-action

I also imagine in reality this strategy will work for most of our workflows however there will be some legacy stuff that will get their own dedicated runners with more custom runner images. I’ve got a repository setup for creating custom images so we’ll be able to provide those easily enough as we need to produce a internal image regardless for our minimal image approach but it's built for a "library" of runners.

Seen we are going to patching very minimally, becuase of the lack of tools baked into the image, we can either have an outage to do it or spin up a second runnerdeployment with the new container image and migrate workflows over through labels in a sort of blue green fashion (having tested this on our tested on our test cluster first ofc :D)

I will try to work on this PR this week btw.

That would be awsome 🥇

@erikkn
Copy link
Contributor Author

erikkn commented Feb 25, 2021

Unfortunately, I started working on other projects so it seems like I am not gonna have time to work on this PR any time soon. I will close the PR now because it is open for 3 months now and nobody showed interest in picking it up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants