Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull-based autoscaling support #1588

Closed
axel3rd opened this issue Jan 4, 2022 · 12 comments
Closed

Pull-based autoscaling support #1588

axel3rd opened this issue Jan 4, 2022 · 12 comments

Comments

@axel3rd
Copy link
Contributor

axel3rd commented Jan 4, 2022

According GitHub Recommended autoscaling solutions, this solution doesn't support Pull-based autoscaling feature.

I'm not sure to fully understand what is this feature, but it could be nice to support it 😁 (as the other major solution).

(No many historic found from af32d).

@npalm
Copy link
Member

npalm commented Jan 5, 2022

It seems for pull based support a list of repositories is configured which are queried for pending jobs to decide to scale. We have chosen essentialy to scale based on events. For our own deployment we must handle 1000+ repositories. So by querying contuous 1000+ repos will hit rate limits. But we are of course open to explore how pull based can be supported if needed.

@toast-gear
Copy link
Contributor

toast-gear commented Jan 5, 2022

Pull based scaling is where the controller scales runners based on a metric every poll period (sync period), the 2 current metrics are:

  1. the queue depth of all workflow runs against a defined list of repositories (intensive API call metric)
  2. the number of busy runners (light API call metric)

It's useful for people that:

  • can't or don't want to use webhooks
  • have slow / low scaling requirements / want something simple
  • want a high degree of control over runner allocation server-side and are happy with overhead of managing named repositories server-side
  • run a GHES environment where you can control your rate limit budget, including disabling rate limiting entirely

@npalm
Copy link
Member

npalm commented Jan 5, 2022

@toast-gear Thanks for the clarification! I worked last week on a PR to add a so called "simple pool" that checks every x interval the number of idle runners wanted, and once not meeting the setting scales up with the required number. See #1577

@npalm
Copy link
Member

npalm commented Jan 6, 2022

In PR #1577 we add a simple way for pull based scaling. This change let you define a pool based on a cron expression and a desired pool size. Based on the cron expression lambda is triggered which will update the pool to the required size. Configuration will be provided as a list, so multiple combination of cron expression and pool sizes can be defined. For example to support a different poolsize on weekdays and weekends.

@toast-gear
Copy link
Contributor

toast-gear commented Jan 6, 2022

@npalm Sounds very interesting and similar-ish to some of the stuff in actions-runner-controller (ARC)! You can setup a schedule to override the min and / or max replica count on a per runner set basis. It's helpful if you just want a basic scale up during core businesss hours to a set amount and scale down outside of core business hours to a set amount setup. With ARC being k8s based it's also helpful for cost optimisation too as you may want to scale your runner node group/s down to 0 outside of core business hours to save £££.

The pull based scaling stuff is more centred around continously scaling up and down driven from some environmental metric/s and reassessed every poll period.

@axel3rd
Copy link
Contributor Author

axel3rd commented Jan 6, 2022

In PR #1577 we add a simple way for pull based scaling

So can we consider that Pull-based autoscaling feature is(/will) supported ?
If "yes", I will change PR github/docs#13742 by "yes (org-level runners)" and stage until merged.

@toast-gear
Copy link
Contributor

toast-gear commented Jan 6, 2022

In terms of as a comparison to ARC it sounds more like schedules but it's sort of close enough right? It's just an informal term really so @npalm's work sounds like it's close enough that it could be considered ticked off feature wise once released? Up to @npalm really though.

@npalm
Copy link
Member

npalm commented Jan 6, 2022

In terms of as a comparison to ARC it sounds more like schedules but it's sort of close enough right? It's just an informal term really so @npalm's work sounds like it's close enough that it could be considered ticked off feature wise once released? Up to @npalm really though.

@toast-gear you are right. The trigger is scheduled. The lambda is checking for the number of active runners before scaling. Do you have any other suggestion that fits in our approach?

@axel3rd
Copy link
Contributor Author

axel3rd commented Jan 6, 2022

(PR github/docs#13742 updated, will be un-draft when #1577 merged)

@toast-gear
Copy link
Contributor

toast-gear commented Jan 7, 2022

I guess knowing a bit of the history would help.

Originally actions-runner-controller only had a single scaling metric, the TotalNumberOfQueuedAndInProgressWorkflowRuns pull based metric. Subsequent to that the PercentageRunnersBusy pull based metric was added and after that the webhook server was added introducing support for webhooks. As a result we needed a way of differentiating between the 2 scaling options as they were funamentally different, the former were built around a poll period, the latter an event. Pull based scaling was chosen for the former as the scaling is driven from environmental details discovered by the controller and webhook based scaling for the latter as the scaling is driven from an event provided by GitHub.

I'd say the key detail which is what makes both pull based and webhook based scaling pull based / webhook based scaling is the scaling is based on some environmental metric and scaling will scale up and down (within the limits of the config) as it is informed from the environment each poll / event e.g. queue depth, how busy runners are or an event. For me, scheduled scaling is a different feature as it isn't really responding to an environmental metric, it has an arbrtiary runner count as defined by the schedule and will keep the count at that level regardless of the environment. In the case of ARC, you can even combine scheduled scaling with pull or webhook driven scaling so it really is its own feature in the ARC project at least.

So if we want to stay true to the pull based scaling term (which is a fairly informal term so it's not the end of the world) then the docs probably need a new column Scheduled scaling and with #1577 merged philips-labs/terraform-aws-github-runner can be said to support this feature. That said I'm not spent the time to go into detail on your new feature so if you feel it fulfills the concept of pull based scaling well enough (either by the characteristics I've suggested or just in your own way) then feel free to make that call.

@axel3rd
Copy link
Contributor Author

axel3rd commented Jan 7, 2022

❤️ pernickety discussions 😁

So perhaps updating the feature name could be more accurate, proposal:

Features actions-runner-controller terraform-aws-github-runner
How runners can be scaled Webhook events, Scheduled, Pull-based Webhook events, Scheduled (org-level runners only)

@axel3rd
Copy link
Contributor Author

axel3rd commented Jan 12, 2022

Can be closed with #1577.
GitHub doc following in github/docs#13742.

@axel3rd axel3rd closed this as completed Jan 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants