Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Workflow job based ephemeral runner scaling #721

Merged
merged 12 commits into from
Aug 11, 2021

Conversation

mumoshu
Copy link
Collaborator

@mumoshu mumoshu commented Aug 10, 2021

This add support for two upcoming enhancements on GitHub side of self-hosted runners, ephemeral runners and workflow_jow events. You can't use these yet.

These features are not yet generally available to all GitHub users. Please take this pull request as a preparation to make it available to actions-runner-controller users as soon as possible after GitHub released necessary features on their end.

Ephemeral runners

The former, ephemeral runners, is basically the reliable alternative to --once, which we've been using when you enabled ephemeral: true (default in actions-runner-controller).

--once has been suffering from a race issue #466. --ephemeral fixes that.

To enable ephemeral runners with actions/runner, you give --ephemeral to config.sh. This updated version of actions-runner-controller does it for you, by using --ephemeral instead of --once when you set RUNNER_FEATURE_FLAG_EPHEMERAL=true.

Please read the section Ephemeral Runners in the updated version of our README for more information.

Note that ephemeral runners is not released on GitHub yet. And RUNNER_FEATURE_FLAG_EPHEMERAL=true won't work at all until the feature gets released on GitHub. Stay tuned for an announcement from GitHub!

workflow_job events

workflow_job is the additional webhook event that corresponds to each GitHub Actions workflow job run. It provides actions-runner-controller a solid foundation to improve our webhook-based autoscale.

Formerly, we've been exploiting webhook events like check_run for autoscaling. However, as none of our supported events has included labels, you had to configure a HRA to only match relevant check_run events. It wasn't trivial.

In contract, a workflow_job event payload contains labels of runners requested. actions-runner-controller is able to automatically decide which HRA to scale by filtering the corresponding RunnerDeployment by labels included in the webhook payload. So all you need to use webhook-based autoscale will be to enable workflow_job on GitHub and expose actions-runner-controller's webhook server to the internet.

Note that the current implementation of workflow_job support works in two ways, increment and decrement. An increment happens when the webhook server receives workflow_job of queued status. A decrement happens when it receives workflow_job of completed status. The latter is used to make scaling-down faster, so that you waste money less than before. You still don't suffer from flapping, as a scale-down is still subject to scaleDownDelaySecondsAfterScaleOut .

To enable workflow_job webhook, go to your Webhook settings page on GitHub and check Workflow jobs:

CleanShot 2021-08-07 at 16 02 48@2x

Please read the section Example 3: Scale on each workflow_job event in the updated version of our README for more information on its usage.

@sledigabel
Copy link
Contributor

Hi @mumoshu, this looks fantastic! Thank you so much for this hard work!

Do I understand that:

@mumoshu
Copy link
Collaborator Author

mumoshu commented Aug 10, 2021

This is related to issue Alpha program for GitHub Action Runner Autoscaling API #697?

Yes

This will significantly reduce the number of API calls from the controller since the job status will be pushed rather than queried? API calls throttling has been one of my main worry as we ramp up our setup.

Mostly yes. To be clear, actions-runner-controller already has webhook-based autoscale and you can use it today to avoid API rate limit issue.

workflow_job support just enhances the existing webhook-based autoscale, so that it is easier to use.

@sledigabel
Copy link
Contributor

Yes, this is great. I already use the webhook autoscaler, it's great for scaling fast but it still produces a lot of calls to check whether the jobs are active or idle etc.
It looks like this change will notify through the webhook that a job is finished instead of having to call <3

@mumoshu
Copy link
Collaborator Author

mumoshu commented Aug 10, 2021

@sledigabel Just to be extra sure- Do your HRA spec contains either PercentageRunnersBusy or TotalNumberOfQueuedAndInProgressWorkflowRuns metric?

a lot of calls to check whether the jobs are active or idle etc.

Perhaps you meant "runners are active or idle"? Then it's very likely you have PercentageRunnersBusy configured.

In that case, you can already omit it in favor of webhook-based autoscale, as long as your scaleUpTriggers configuration covers everything.

@sledigabel
Copy link
Contributor

@mumoshu It runs both at the moment, but I think all the API calls are mostly coming from the checks:
https://github.com/actions-runner-controller/actions-runner-controller/blob/master/controllers/runner_controller.go#L353
It doesn't look like this will go away with this, but the benefits from this PR are still amazing, being able to downscale quicker is awesome.

BTW on those number of calls I think there would be an opportunity to "save" some calls. I can create an issue separately for that.

Thanks again for this!

@sledigabel
Copy link
Contributor

Also, I love that ephemeral runners are now an upstream feature! :-)

@mumoshu mumoshu merged commit fabead8 into master Aug 11, 2021
@mumoshu mumoshu deleted the workflow-job-based-ephemeral-runner-scaling branch August 11, 2021 00:52
@mumoshu mumoshu mentioned this pull request Aug 15, 2021
2 tasks
mumoshu added a commit that referenced this pull request Aug 16, 2021
mumoshu added a commit that referenced this pull request Aug 17, 2021
mumoshu added a commit that referenced this pull request Aug 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants