-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Feature Autoscaling: Enable organization runner autoscaler #213
Conversation
Fixes #158 |
Currently blocked on this: #221 |
One other comment - some api's that you will be depending on in the latest go-github release might not be available for github enterprise users (e.g. get action enabled repositories exists for teams, pro versions of github but not for github enterprise) This will break deployments. We should use the strategy pattern here. |
f07ac29
to
30725bc
Compare
It turned out previous versions of runner images were unable to run actions that require `AGENT_TOOLSDIRECTORY` or `libyaml` to exist in the runner environment. One of notable examples of such actions is [`ruby/setup-ruby`](https://github.com/ruby/setup-ruby). This change adds the support for those actions, by setting up AGENT_TOOLSDIRECTORY and installing libyaml-dev within runner images.
30725bc
to
2534945
Compare
@ZacharyBenamram thank you again for your reviews! What do you think of the current change proposal? I haven't updated the |
return nil, fmt.Errorf("validating autoscaling metrics: one or more metrics is required") | ||
} else if tpe := metrics[0].Type; tpe != v1alpha1.AutoscalingMetricTypeTotalNumberOfQueuedAndInProgressWorkflowRuns { | ||
return nil, fmt.Errorf("validting autoscaling metrics: unsupported metric type %q: only supported value is %s", tpe, v1alpha1.AutoscalingMetricTypeTotalNumberOfQueuedAndInProgressWorkflowRuns) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually add another metrics type here instead of overwriting the previous implementation. What do you think about what I've written in https://github.com/summerwind/actions-runner-controller/pull/223/files#diff-2fb5cc41bbf098ed839a501baea824efe104844213ce3969a63523439a132600R31
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That way https://github.com/summerwind/actions-runner-controller/pull/213/files#diff-2fb5cc41bbf098ed839a501baea824efe104844213ce3969a63523439a132600R41 wont break deployments on github enterprise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work you did on your PR!
So when I look at the new calculateReplicasByPercentageRunnersBusy
method you introduce, you are essentially fetching all the runners and subsequently iterate over that list by checking which one is busy. Next, you have a couple of conditionals that instructs you to either scale up or down.
I like it because this also allows us to get completely rid of checking queues on an individual repository level. What do you reckon to do next here? I haven't given your PR a very thorough review yet, but with yours merged I guess we can close this one?
I am not sure if it makes sense of cherry-picking your changes, refactor this PR accordingly and keeping the logic, including the ugly
if len(metrics[0].RepositoryNames) < 1 && r.GitHubClient.GithubEnterprise
conditional, specifically just for the calculateReplicasByQueuedAndInProgressWorkflowRuns
metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your hpa scheme is also very much needed. I recommend you create another hpa scheme and add it into this if else block and pull out your changes into a separate func. Something like v1aplha1.AutoscalingMetricTypeNumQueuedAndInProgressForEnabledRepositories
and then leave the original scheme as is.
I don't mind keeping these as separate PRs and if yours goes in first, merge your changes into mine afterwards. What are your thoughts?
@erikkn any update on this? |
@callum-tait-pbx , haven't really had the time to work on this. Are you waiting for this, or can you use |
Tbh I haven't had a chance to get around to testing the the |
Fair enough, but how will you deal with patching your image (and the binaries living in those images)? And how to make changes to your environment without changing XYZ number of repos? I will try to work on this PR this week btw. |
We intend to use actions themselves for environment setup rather than baking tools directly into the runner images. So to use a public action as an example, instead of having node baked into the runner image we will use https://github.com/actions/setup-node to configure the required node version as defined in the workflow rather than having it baked into the underlying container. Obviously there are some cons doing it that way but we believe it will be a net positive approach for multiple reasons. We may end up having to bake some tooling into the runner image; it should be very limited beyond some internal tooling. We'd rather build out custom actions for things we can't find an appropriate open source action for than go down the other possible routes of lots of custom images or fewer but more complicated super custom images. Actions can be version pinned so once we publish a new version of an action it’s a case of communicating and socialising internally. We’ll support X number of releases and then the actions will go into a deprecation process, basically a software house in a software house. There are some good boilerplate templates to start from for creating actions e..g https://github.com/jacobtomlinson/python-container-action https://github.com/jacobtomlinson/go-container-action I also imagine in reality this strategy will work for most of our workflows however there will be some legacy stuff that will get their own dedicated runners with more custom runner images. I’ve got a repository setup for creating custom images so we’ll be able to provide those easily enough as we need to produce a internal image regardless for our minimal image approach but it's built for a "library" of runners. Seen we are going to patching very minimally, becuase of the lack of tools baked into the image, we can either have an outage to do it or spin up a second runnerdeployment with the new container image and migrate workflows over through labels in a sort of blue green fashion (having tested this on our tested on our test cluster first ofc :D)
That would be awsome 🥇 |
Unfortunately, I started working on other projects so it seems like I am not gonna have time to work on this PR any time soon. I will close the PR now because it is open for 3 months now and nobody showed interest in picking it up. |
This PR introduces a workaround for autoscaling your organization runners. At the moment, Github doesn't provide an API endpoint that gives you a number of how full your (organizational) queue is. This PR tries to implement a workaround for this, by checking all the repos in an organization for which Actions is enabled and subsequently checks the latest changed repositories. It does the latter by calculating how many minutes ago a repo got changed and compares this number with the
SyncPeriod
time.Subsequently, the controller will watch the queues of these repositories and scale up/down.