Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Operation Cancelled Runner Decommissions #911

Closed
jbkc85 opened this issue Oct 26, 2021 · 83 comments
Closed

Random Operation Cancelled Runner Decommissions #911

jbkc85 opened this issue Oct 26, 2021 · 83 comments
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@jbkc85
Copy link

jbkc85 commented Oct 26, 2021

Describe the bug

In the current deployment of actions-runner-controller, whenever a GitHub runner has an action cancelled (via GitHub UI) or has an action that fails (like a unit test), it seems to cause a cascading effect that triggers other random cancellations of actions running on other runners in the system. Furthermore, there is a high chance that at the end of a GitHub workflow that one of the last remaining runners (IE: a runner that survives until a scale down is triggered via reconciling) is cancelled while 'busy' with a Workflow Job/Step.

Checks

  • we are not using ephemeral runners at this time - though ephemeral runners appear to have the same issue
  • we are on vsummerwind/actions-runner-controller:v0.20.1 and use the the provided GitHub-runner with our own custom OpenJDK installation
  • controller is installed via Helm:
syncPeriod: 5m
githubAPICacheDuration: 5m
nodeSelector:
  environment: ops
  purpose: infrastructure
  • AWS EKS w/ Cluster Autoscaling and Spot Instances (confirmed spot instances aren't causing a shutdown issue)

To Reproduce
Steps to reproduce the behavior (though is more random than not):

  1. Setup GitHub-action-controller with the above Helm deployment and the following spec:
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  namespace: default
  name: org-runners
spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    spec:
      dockerdWithinRunnerContainer: true
      organization: <Github Org>
      nodeSelector:
        capacity-type: SPOT
        purpose: github-runners
        environment: ops
      labels:
        - java
        - linux
        - eks
        - self-hosted
      image: <GitHub-runner image here>
      resources:
        requests:
          cpu: "1.0"
          memory: "10Gi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  namespace: default
  name: runners-autoscaler
spec:
  scaleTargetRef:
    name: org-runners
  scaleDownDelaySecondsAfterScaleOut: 3600
  minReplicas: 1
  maxReplicas: 150
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - <your repository here>
  1. Run a large set (32-64 jobs) workflow and cancel one or two of the jobs.

Expected behavior
The expectation is that the jobs continue - regardless of success/failure and the workflow succeeds in finishing the appointed jobs/tasks.

Screenshots
No screenshots, but the lines simply output Error: The Operation was canceled.

Environment (please complete the following information):

  • Controller Version: v0.20.1
  • Deployment Method: Helm
  • Helm Chart Version: v0.13.1

Additional context
The workflow is extremely active and can have anywhere between 32-64 running jobs. Each job takes between 6 and 20 minutes, depending on the test it is deploying. What I have seen is that it appears that the logs for 'unregistered' runners have an impact once those runners are to be reconciled. I see a lot of:

2021-10-26T05:41:39.576Z	DEBUG	actions-runner-controller.runnerreplicaset	Failed to check if runner is busy. Either this runner has never been successfully registered to GitHub or it still needs more time.	{"runnerreplicaset": "default/org-runners-749hr", "runnerName": "org-runners-749hr-gr85x"}
2021-10-26T05:41:40.172Z	INFO	actions-runner-controller.runner	Skipped registration check because it's deferred until 2021-10-26 05:42:30 +0000 UTC. Retrying in 48.827500109s at latest	{"runner": "default/org-runners-749hr-w9ll4", "lastRegistrationCheckTime": "2021-10-26 05:41:30 +0000 UTC", "registrationCheckInterval": "1m0s"}

and based on the behavior, I wonder if what's happening is that the reconciler takes a registration timeout of a previous pod and applies it to these as they are trying to catch up - thus resulting in a runner being marked for deletion even though it is only a minute or two into its registration check interval.

With that being said I was hoping to increase these intervals and have more of a grace period between the controller just shutting things down where it doesn't 'know' if the runner is busy or not, but honestly I can't tell if thats what is happening.

Any insight would be helpful - and let me know if there is anything else I can provide!

@mumoshu
Copy link
Collaborator

mumoshu commented Oct 29, 2021

@jbkc85 Hey! Have you already tried the new ephemeral runners that can be enabled via RUNNER_FEATURE_FLAG_EPHEMERAL as documented in https://github.com/actions-runner-controller/actions-runner-controller#ephemeral-runners?

Also- which flavor of GitHub are you using, GitHub Enterprise Server, or GitHub Cloud?

@jbkc85
Copy link
Author

jbkc85 commented Oct 29, 2021

@jbkc85 Hey! Have you already tried the new ephemeral runners that can be enabled via RUNNER_FEATURE_FLAG_EPHEMERAL as documented in https://github.com/actions-runner-controller/actions-runner-controller#ephemeral-runners?

Also- which flavor of GitHub are you using, GitHub Enterprise Server, or GitHub Cloud?

Normal GitHub (thats cloud? Def not enterprise) mixed in with AWS EKS deployed runners.

I have tried ephemeral and to no luck. Generally what happens is that during a scale up event, when the reconciler starts to run through its scaling down, it picks new pods not yet registered (because they haven't had a chance to be registered) and marks them for deletion. I might try Ephemeral again to see what we can do, as maybe we just didn't catch the issue earlier. I will get back to you!

@antodoms
Copy link

antodoms commented Nov 3, 2021

@jbkc85 We are also experiencing the same issue. have you found a solution to this?

i am currently using stateful RunnerSet and webhook terminates wrong pod during a scale-down web-hook event.

Was this working on versions before 0.20.1 ??

@jbkc85
Copy link
Author

jbkc85 commented Nov 4, 2021

@jbkc85 We are also experiencing the same issue. have you found a solution to this?

i am currently using stateful RunnerSet and webhook terminates wrong pod during a scale-down web-hook event.

Was this working on versions before 0.20.1 ??

I am not sure it was ever 'working' depending on your workload. The fact is the reconciler reconciles nodes that have not yet 'registered', and when you are spinning up (and down) a ton of nodes, the registration falls behind due to API limits or otherwise, making some of your runners eventually reconciled prematurely. I am trying to pinpoint the exact cause of this in the code but I haven't had as much luck as I wished.

One way to eliminate SOME of these issues is to set your scaleDownDelaySecondsAfterScaleOut in the HoriziontalRunnerAutoscaler higher (I set mine to 3600), but its a temporary fix. If you are constantly churning through runners, eventually you don't scale out anymore in which case the delay times out (as its not reset per a scale up) and you run into the same issue.

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 4, 2021

@jbkc85 Thanks for the info!
I was reviewing everything you've provided in this issue and I'm not sure what's happening yet.

the reconciler takes a registration timeout of a previous pod and applies it to these as they are trying to catch up - thus resulting in a runner being marked for deletion even though it is only a minute or two into its registration check interval.

I've read code but AFAIK this shouldn't happen as we use runner's CreationTimestamp for registration check timeout and it's reset by deleting/recreating the runner on scale-down/up.

https://github.com/actions-runner-controller/actions-runner-controller/blob/b6c33cee3276da76993c550c68fe6c9278b20464/controllers/runnerreplicaset_controller.go#L227

With that being said I was hoping to increase these intervals and have more of a grace period between the controller just shutting things down where it doesn't 'know' if the runner is busy or not

I think this one is hard-coded today:

https://github.com/actions-runner-controller/actions-runner-controller/blob/b6c33cee3276da76993c550c68fe6c9278b20464/controllers/runnerreplicaset_controller.go#L225

Would you mind changing it and building a custom actions-runner-controller image using our makefile to see making it longer works for you?

the reconciler starts to run through its scaling down, it picks new pods not yet registered (because they haven't had a chance to be registered) and marks them for deletion

I was thinking that we already guard against that case in

https://github.com/actions-runner-controller/actions-runner-controller/blob/b6c33cee3276da76993c550c68fe6c9278b20464/controllers/runnerreplicaset_controller.go#L196-L249

In nutshell we log Failed to check if runner is busy. Either this runner has never been successfully registered to GitHub or it still needs more time and delete the runner only when we are sure that your runner isn't available according to GitHub API. If the runner isn't available, it shouldn't be running any workflow job. It needs to register itself to GitHub before running any workflow job.

I have tried ephemeral and to no luck

What was your runner image tag in image: <GitHub-runner image here> and how your RunnerDeployment looked like when you tried ephemeral runners?

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 4, 2021

@antodoms Hey! Thanks for reporting. I think yours is a completely different issue. Would you mind creating another issue for it, so that we can better differentiate it and avoid other folks bringing similar but different issue reports here which is confusing?

Anyway, I'm doubting if you're being affected by a potential bug in RunnerSet implementation. It's implemented separately from RunnerDeployment. RunnerDeployment has some logic to prevent busy runners from being terminated.

I might have missed implementing simiilar logic for RunnerSet.

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 10, 2021

@antodoms Hey! Have you used RUNNER_FEATURE_FLAG_EPHEMERAL to enable the new runners?

I reviewed both actions-runner-controller and actions/runner's (C#) code and I got to think that the use of ephemeral runners would solve your issue for RunnerSet.

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 10, 2021

@jbkc85 Hey! Which runner image did you use, btw? I still think the use of new ephemeral runners would resolve your issue. So I was wondering if you used some outdated runner image that doesn't yet support RUNNER_FEATURE_FLAG_EPHEMERAL, or your RunnerDeployment spec was somehow wrong? 🤔

If you could share the exact RunnerDeployment manifest you used to test RUNNER_FEATURE_FLAG_EPHEMERAL, i'd greatly appreciate it! Thanks.

@mumoshu mumoshu added the question Further information is requested label Nov 10, 2021
@antodoms
Copy link

antodoms commented Nov 10, 2021

@mumoshu, basically i want to share the docker cache like images between multiple github runner pods, for which i thought RunnerSet is best, but now that i know RunnerSet uses StatefulSet i think it will create a new volume for each pod instead of sharing the volume between pod, so i am currently changing my runners to use RunnerDeployment with PersistentVolumeClaim using EFS.

The only thing i am not able to do so far is reusing the existing images on the next pod, not sure if i am targetting the volume correctly.

this is my current config

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "{{ .Values.runnerName }}"
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 100Gi
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: "{{ .Values.runnerName }}"
  namespace: {{ .Release.Namespace }}
spec:
  replicas: {{ .Values.desiredReplica }}
  template:
    metadata:
      labels:
        app: "{{ .Values.runnerName }}"
        namespace: {{ .Release.Namespace }}
    spec:
      ephemeral: false
{{ if eq .Values.dockerWithinDocker true }} 
      image: {{ .Values.runnerImageDind }}
      dockerdWithinRunnerContainer: true
{{ else }}
      image: {{ .Values.runnerImage }}
{{ end }}
      repository: {{ .Values.repository }}
      labels:
        {{ toYaml .Values.labels | indent 4 }}
      volumes:
        - name: "{{ .Values.runnerName }}"
          emptyDir: {}
      volumeMounts:
        - name: "{{ .Values.runnerName }}"
          mountPath: /var/lib/docker
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: "{{ .Values.runnerName }}"
  namespace: {{ .Release.Namespace }}
spec:
  scaleDownDelaySecondsAfterScaleOut: {{ .Values.scaleDownDelaySecondsAfterScaleOut }}
  scaleTargetRef:
    name: "{{ .Values.runnerName }}"
{{ if eq .Values.statefulRunners true }}
    kind: "RunnerSet"
{{ end }}
  minReplicas: {{ .Values.minReplica }}
  maxReplicas: {{ .Values.maxReplica }}
  scaleUpTriggers:
  - githubEvent: {}
    duration: {{ .Values.coolDownDuration }}
    amount: {{ .Values.scaleCountPerWorkflow }}
---

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 10, 2021

@antodoms Thanks! But, AFAIK, dockerd isn't designed to share/var/lib/docker across multiple dockerd processes.

Some adventurous folks have tried to bind-mount /var/lib/docker from the host into the Docker-in-Docker container. Sometimes they share /var/lib/docker with multiple containers.
The Docker daemon was explicitly designed to have exclusive access to /var/lib/docker. Nothing else should touch, poke, or tickle any of the Docker files hidden there.
https://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-docker-for-ci/

So, all you could do would be to have a single persistent volume per pod and ensure RUNNER_FEATURE_FLAG_EPHEMERAL is set. That's basically RunnerSet + RUNNER_FEATURE_FLAG_EPHEMERAL.

@jbkc85
Copy link
Author

jbkc85 commented Nov 10, 2021

@jbkc85 Hey! Which runner image did you use, btw? I still think the use of new ephemeral runners would resolve your issue. So I was wondering if you used some outdated runner image that doesn't yet support RUNNER_FEATURE_FLAG_EPHEMERAL, or your RunnerDeployment spec was somehow wrong? 🤔

If you could share the exact RunnerDeployment manifest you used to test RUNNER_FEATURE_FLAG_EPHEMERAL, i'd greatly appreciate it! Thanks.

@mumoshu I am so sorry, ive been out sick for the last ... well, pretty much week. I am going to try out a few of these things this week and see what happens. I do know that we used the ephemeral flag at one point, but believe we saw the same issues....I can try it again this week when I get back into action and see what happens!

Thank you again for all the responses. I do apologize for leaving things high and dry w/ my absence.

@jbkc85
Copy link
Author

jbkc85 commented Nov 11, 2021

@mumoshu now I remember. We had Ephemeral on, but due to the constant checking of check-ins we actually ran ourselves out of GitHub API calls constantly. We are talking about a scaling from 1 to 130 nodes relatively quickly...so yea, the ephemeral runners hurt us there.

I will still give this a shot whenever I get a chance. I do appreciate your communication and effort on this project!

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 12, 2021

@jbkc85 Hey! Thanks for the response. Yeah. you mentioned that it might be due to API rate limit. But seeing runner and the controller's implementation/code, I'd say API rate limit shouldn't affect runner availability.

#911 (comment)

It would be great if you could provide us more complete evidence of API rate limit affecting runner availability, or easy steps to reproduce your issue on anyone's environment, if possible, so that we can leverage our collective human resource to further debug it together!

Anyway, API rate limit issue will mostly disappear once we managed to make #920. Thanks for suggesting that, too!

@gmambro
Copy link
Contributor

gmambro commented Nov 22, 2021

Hello, I am experiencing something quite similar to what @jbkc85 described.
Looking at the logs I can see a sequence of events like this one.

  • At 17:01:16.922 controller logs Created runner pod {"runner": "runners-ns/runner-5nzbv-nsns2...
  • 0.3 seconds later the controller logs Runner pod exists but we failed to check if runner is busy
  • Then it logs Rechecking the runner registration in 1m10.104479734s and
  • Skipped registration check because it's deferred until 2021-11-22 17:02:17 +0000 UTC. Retrying in 58.735154414s at latest
  • 17:01:17.863 the pod starts with --ephemeral
  • 17:01:20.489 pod/runner logs √ Runner successfully added
  • 17:01:21.413 pod/runner logs √ Runner connection is good
  • 17:01:21.503 pod/runner logs Starting Runner listener with startup type: service
  • 17:01:24:377 pod/docker logs "Daemon has completed initialization"
  • 17:01:27:495 pod/runner logs Running job: ...
  • the jobs goes without problems
  • 17:11:35.185 controller logs Already existing GitHub runner still appears offline . Recreating the pod to see if it resolves the issue. CAUTION: If you see this a lot, you should investigate the root cause. and then Deleted runner pod
  • 17:11:35.895 pod/runner shutting down the listener

Please let me know if you have any suggestions of things we could try or other info you need to better understand it. Thank you!

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 23, 2021

@gmambro Hey! Your log seems quite normal to me. Did you actually see Operation Cancelled error on GitHub Actions UI and you think that's your issue, right?

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 23, 2021

@jbkc85 @gmambro Just curious, but did you miss setting fail-fast: false in your workflow definitions, by any chance? AFAIK, it's true by default which means GitHub Actions cancels all the other jobs on a single job failure. (This behavior is totally irrelevant to actions-runner-controller

@gmambro
Copy link
Contributor

gmambro commented Nov 23, 2021

The problem is that when the controller says existing GitHub runner still appears offline and deletes the runner pod, the pod which is still running the job. So the runner started, ran a job but according to the controller log is like it was never online.
Regarding the fail-fast, I am ruling it out as the cause here, the unexpected cancellations seem are eventually the ones that can trigger other cancellations when the flag is set to true.

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 23, 2021

@gmambro Hey! Thanks for confirming.

  • 17:01:27:495 pod/runner logs Running job: ...
  • the jobs goes without problems
  • 17:11:35.185 controller logs Already existing GitHub runner still appears offline . Recreating the pod to see if it resolves the issue. CAUTION: If you see this a lot, you should investigate the root cause. and then Deleted runner pod
  • 17:11:35.895 pod/runner shutting down the listener

I reread this and all I can say is that this can happen when GitHub somehow failed to update the runner status.

Here's actions-runner-controller's runner-controller code that detects this condition:

https://github.com/actions-runner-controller/actions-runner-controller/blob/eefb48ba3f46f6cb2dfb1696ef44950506630b6f/controllers/runner_controller.go#L424-L434

https://github.com/actions-runner-controller/actions-runner-controller/blob/eefb48ba3f46f6cb2dfb1696ef44950506630b6f/controllers/runner_controller.go#L362-L363

So, this literally happens only when the runner is actually offline (as far as the controller can see from the response of the GitHub API call)

https://github.com/actions-runner-controller/actions-runner-controller/blob/eefb48ba3f46f6cb2dfb1696ef44950506630b6f/controllers/runner_controller.go#L353

https://github.com/actions-runner-controller/actions-runner-controller/blob/eefb48ba3f46f6cb2dfb1696ef44950506630b6f/github/github.go#L339-L340

Maybe this is caused by GitHub, rather than by actions-runner-controller?

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 23, 2021

@gmambro Another thing I wanted to confirm is- Are you sure there's no suspicious event on the runner pod?

I guess this can also happen when it's running on an EC2 spot instance or a GCP preemptive VM or alike, that forcefully terminates the runner pod when the VM gets automatically stopped.

@gmambro
Copy link
Contributor

gmambro commented Nov 23, 2021

Maybe this is caused by GitHub, rather than by actions-runner-controller?

This is what I was afraid of, so I wondered if there were are thing we could check in the logs before calling it a GH issue. At first I thought of rate limit but after looking at this discussion I found that is not the case it will be logged in another way.
I am not on EC2 spots but I can double check for other external events. Thank you!

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 23, 2021

@gmambro Thanks! Yeah- as you've discovered the controller should print specific log message when it sees a GitHUb API rate limit.

A dirty way to debug this further before filing an issue against GitHub would be this: you can try spamming curl against GitHub's list runners and get runner APIs so that you can see if the runner status is really offline (or not).

@gmambro
Copy link
Contributor

gmambro commented Nov 23, 2021

Quick update. There is no external autoscaling thing that kills pods or nodes. Indeed the problem seems to be in APIs: the runner which gets killed is reported as offline. However, there is an interesting thing: the runner is offline but busy at the same time. This can probably help in detecting these anomalies (e.g. don't delete offline runners if they are busy)

@jbkc85
Copy link
Author

jbkc85 commented Nov 23, 2021

@jbkc85 @gmambro Just curious, but did you miss setting fail-fast: false in your workflow definitions, by any chance? AFAIK, it's true by default which means GitHub Actions cancels all the other jobs on a single job failure. (This behavior is totally irrelevant to actions-runner-controller

we have used fail fast, but in the situations where we see a random cancellation its not this particular issue.

Quick update. There is no external autoscaling thing that kills pods or nodes. Indeed the problem seems to be in APIs: the runner which gets killed is reported as offline. However, there is an interesting thing: the runner is offline but busy at the same time. This can probably help in detecting these anomalies.

what @gmambro said. Every time I have looked into it, I have seen problems with the node being 'busy', but still being marked for deletion. I am trying to find more logs from the time it happened, but its hard to catch in the act per its randomness.

@gmambro
Copy link
Contributor

gmambro commented Nov 23, 2021

@jbkc85 to me looks like GH API gives inconsistent status for the runner which will be killed, as I used a small client to monitor it and I found that it was marked as offline but busy.

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 24, 2021

the runner is offline but busy at the same time. This can probably help in detecting these anomalies (e.g. don't delete offline runners if they are busy)

@gmambro Wow, this is great news! If that's the only edge case I need to be aware of, it won't be that hard to implement a workaround in the actions-runner-conroller side.

It can be as easy as updating this line to } else if offline && !busy { (or we can add another else branch for additional log message that says the controller did detected this condition and delayed the runner termination)

https://github.com/actions-runner-controller/actions-runner-controller/blob/eefb48ba3f46f6cb2dfb1696ef44950506630b6f/controllers/runner_controller.go#L416

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 24, 2021

@gmambro I'd appreciate it if you could share your additional insight on this- I'm now wondering if it ever becomes online or not.

The current actions-runner-controller design assumes that a runner's life is like:

  1. runner is created
  2. runner becomes offline, not busy
  3. runner becomes online, is still not busy
  4. runner is still online, becomes busy
  5. runner is still online, becomes not busy
  6. runner becomes offline, is still not busy
  7. runner disappears (from GitHub API and UI)

How does it look like for your problematic runner?

Is it like:

  1. runner is created
  2. runner becomes offline, not busy
  3. runner becomes online, not busy
  4. runner is still online, and becomes busy
  5. runner becomes offline, busy
    ...

or:

  1. runner is created
  2. runner becomes offline, not busy
  3. runner is still offline, becomes busy
    ...

Honestly speaking I'm not fully sure how I'd reflect that information to the controller's design yet.
But I'd like to enhance the controller's logs at least, so that you can figure out where's the issue is coming from when an anomaly happens.

@gmambro
Copy link
Contributor

gmambro commented Nov 24, 2021

At some point, they switch from online/busy to offline/busy and then a few minutes later the controller kills them. Let me run some more tests and I will come back.

@gmambro
Copy link
Contributor

gmambro commented Nov 24, 2021

Since today I wasn't able to repro this issue, so still don't have as many experiments as I'd wish to share with you.
However, this is how it looked like so far

  • runner is created
  • runner becomes offline, not busy
  • runner becomes online, not busy
  • runner is still online, and becomes busy
  • runner becomes offline, busy

I wonder if the runners can become temporarily offline/busy because of the job consuming all of its resources. If this behaviour is confirmed it would be nice to set something like a grace period for offline/busy to give them a chance to go back online.

@gmambro
Copy link
Contributor

gmambro commented Nov 30, 2021

So, got some more repros, the story looks like this:

  • runner created
  • runner online, not busy
  • runner executes a job
  • runner online/busy
  • later, while running the job, the runner becomes offline/busy
  • one minute later the controller logs Already existing GitHub runner still appears offline . Recreating the pod to see if it resolves the issue and then kills the runner

@tuomoa
Copy link

tuomoa commented Feb 21, 2022

@tuomoa #1126 has been merged. I'd appreciate it if you could give it a try, by building a custom version of ARC and deploying it onto your environment 😃

https://github.com/actions-runner-controller/actions-runner-controller/blob/master/CONTRIBUTING.md#testing-controller-built-from-a-pull-request

Okay great. I'll try to give it a go later this week. 👍

@tuomoa
Copy link

tuomoa commented Feb 22, 2022

@tuomoa #1126 has been merged. I'd appreciate it if you could give it a try, by building a custom version of ARC and deploying it onto your environment 😃
https://github.com/actions-runner-controller/actions-runner-controller/blob/master/CONTRIBUTING.md#testing-controller-built-from-a-pull-request

Okay great. I'll try to give it a go later this week. 👍

@mumoshu I gave it a go and it seems to work a little bit better, but still something odd happens. Out of scaled-up 8 runners 2 failed in similar manner.

1.6455157315321176e+09	ERROR	controller.runnerpod-controller	Reconciler error	{"reconciler group": "", "reconciler kind": "Pod", "name": "runnerset-7", "namespace": "default", "error": "runner is busy"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
...
1.6455163258324692e+09	INFO	actions-runner-controller.runnerpod	Removed runner from GitHub	{"runnerpod": "default/runnerset-7", "repository": "ourorganization/ourrepository", "organization": ""}
1.6455163258325107e+09	INFO	actions-runner-controller.runnerpod	Failed to delete pod within 1m0s. This is typically the case when a Kubernetes node became unreachable and the kube controller started evicting nodes. Forcefully deleting the pod to not get stuck.	{"runnerpod": "default/runnerset-7", "podDeletionTimestamp": "2022-02-22 07:37:34 +0000 UTC", "currentTime": 1645516325.8325045, "configuredDeletionTimeout": "1m0s"}

But still the runner pod is actually terminated (when it was running a job). I think it was terminated before the Removed runner from GitHub line is logged. EDIT: It seems like when scaledown events are received and calculated replica count starts decreasing some pods are terminated even though controller just logs "runner is busy". Then after a while controller says that Removed runner from GitHub and Failed to delete pod within 1m0s...Forcefully deleting the pod to not get stuck. and continue to delete the pod. But it looks like the pod was actually stopped before this happens. So the pod is stopped, but not deleted, and then it will be deleted after a while when it's become offline.

    state:
      terminated:
        containerID: docker://5f8e4d42f953e81495d3fe474af7fa5289183f05352d5b78581f2994ad051d1c
        exitCode: 143
        finishedAt: "2022-02-22T07:37:34Z"
        reason: Error
        startedAt: "2022-02-22T07:35:17Z"
  Normal   Started                 21m                kubelet                  Started container runner
  Normal   Killing                 2m24s              kubelet                  Stopping container runner

Also I observed some other weird behaviour. For some reason the calculated desired replicas suddenly drops from 15->1 when the pipeline is running and there are also queued jobs:

1.6455133051420949e+09	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 15	{"horizontalrunnerautoscaler": "default/runner-autoscaler", "suggested": 0, "reserved": 2, "min": 0, "cached": 0, "last_scale_up_time": "2022-02-22 06:52:56 +0000 UTC", "scale_down_delay_until": 1645513376, "max": 15}
...
1.6455133760791142e+09	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 1	{"horizontalrunnerautoscaler": "default/runner-autoscaler", "suggested": 0, "reserved": 1, "min": 0, "cached": 0, "max": 15}

I tried to track this down by looking at the webhook log, and it shows that when it receives completed webhook for other jobs (not using self-hosted flag, it still does scale-down):

1.6455137469946017e+09	INFO	controllers.webhookbasedautoscaler	job scale up target is repository-wide runners	{"event": "workflow_job", "hookID": "334351049", "delivery": "...", "workflowJob.status": "completed", "workflowJob.labels": [], "repository.name": "ourrepository", "repository.owner.login": "ourrepoowner", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "completed", "repository": "ourrepository"}
1.6455137469946122e+09	INFO	controllers.webhookbasedautoscaler	Patching hra for capacityReservations update	{"before": null, "after": null}
1.6455137470045798e+09	INFO	controllers.webhookbasedautoscaler	scaled runner-autoscaler by -1

I'm not sure why this happens, but could it be that https://github.com/actions-runner-controller/actions-runner-controller/blob/0b9bef2c086a54b0731fa7aa0a6bd7cb856b48ee/controllers/horizontal_runner_autoscaler_webhook.go#L695 doesn't really consider the case that labels in the event are empty (jobs running in cloud, and some jobs running in self-hosted) and does not break out and instead returns the scale target?

EDIT: I added some debug logging to my custom built image. I'm still not sure what happens, but I'm now thinking that is there a problem when https://github.com/actions-runner-controller/actions-runner-controller/blob/7156ce040ec65dd80863f390f8397dbf01c8466d/controllers/runnerset_controller.go#L216 patches new desired replicas (maybe because calculating it wrong based on the issue above?) and it doesn't consider correctly that runners are busy and starts to terminate pods, which then get deleted later on? Or should the runner-pod finalizer make sure that pod is not terminated before it's not busy, or is it now preventing deletion instead of termination?

When you tell Kubernetes to delete an object that has finalizers specified for it, the Kubernetes API marks the object for deletion by populating .metadata.deletionTimestamp, and returns a 202 status code (HTTP "Accepted"). The target object remains in a terminating state while the control plane, or other components, take the actions defined by the finalizers. After these actions are complete, the controller removes the relevant finalizers from the target object. When the metadata.finalizers field is empty, Kubernetes considers the deletion complete and deletes the object.

Maybe it starts scaling down, then pod enters terminating state and then pod deletion is prevented by the finalizer as long as it says it's busy. Finally deletion times out and it's forcefully deleted.? Is there a way to make finalizer prevent actual termination?

@genisd
Copy link
Contributor

genisd commented Feb 22, 2022

I built master and am testing/running this as we speak. With only webhooks (not metrics) and 2 (used) runner configurations

@genisd
Copy link
Contributor

genisd commented Feb 23, 2022

This issue kinda crept up, while I saw this happening before it perhaps happened quicker/more often.
Runner pod entering NotReady state in kubernetes (runner finished job and exited with exit code 0). On github it already enqueued a new job for that pod, but the pod never restarts (k8s doesn't because it's configured not to) and the job does not get picked up ever. @tuomoa is this perhaps what you are experiencing as well?
#1144

So if I configured k8s to restart the pod with RestartPolicy: Always it works like a charm, i don't know what other implications that change could have.

@toast-gear toast-gear pinned this issue Feb 23, 2022
@tuomoa
Copy link

tuomoa commented Feb 24, 2022

I don't think our issue is related to that. According to my testing the issue appears this way:

  1. 10 jobs are queued, RunnerSet is scaled up to 10 replicas
  2. all the runners are running a job
  3. one of them (for example runner with replica index 0) finishes and completed event is received
  4. RunnerSet replicas is scaled to 9
  5. Because of how RunnerSet works (StatefulSet with podManagementPolicy: OrderedReady), it's is scaled in ordered manner up and down
  6. Because desired replicas goes from 10-> 9 the runner with replica index 9 (the 10th runner) is terminated (not deleted yet). But because it's terminated the running job is killed.
  7. ARC finalizer prevents pod deletion because runner was busy when it was terminated. After 10 minutes (DefaultScaleDownDelay = 10 * time.Minute) ARC will finally forcefully run the unregister/delete finalizer and pod is deleted.

I think before the fix which was merged to master the finalizer didn't prevent pod deletion because runner was offline and busy (to not get stuck?). So before the same thing happened but faster, because pod was terminated (because StatefulSet desired replicas decreasing) and then it was also deleted faster.

I stumbled across this actions/runner#1658 (comment) and tried a similar workaround:

...
    spec:
      containers:
      - name: runner
        image: summerwind/actions-runner-dind
        lifecycle:
          preStop:
            exec:
              command:
                - "sh"
                - "-c"
                - "while [ $(ps -C Runner.Worker -o pid=) ]; do echo \"Still running\"; sleep 1; done && echo \"Runner is idling\""

This will prevent StatefulSet killing the process (when trying to terminate the pod) as long as the runner worker process running the job is running. It is kinda ugly hax but seems to work okay... Would be great if actions/runner#699 would be implemented. Also this requires increasing terminationGracePeriodSeconds to the amount of "slowest possible job" which could be blocking the termination (to make it able to finish the job before killing the pod ungracefully). Also this requires using RUNNER_FEATURE_FLAG_EPHEMERAL because otherwise runner might pick up a job just before it's killed. @genisd I think you should probably use this new flag (which will pass --ephemeral to the GH runner instead of older --once which is known the be buggy in a way that it might pick up a new job before exiting).

This makes me think that scaling the RunnerSet (= using StatefulSet) is maybe not the best way to accomplish autoscaled runners at least if you don't have quite static usage for runners. Because you can't be sure which runners should be scaled down and it tries to scale down the last one. Even though ARC could know which runners are idling. However when ephemeral runner is used, it should eventually be able to scale down the "last replicas" correctly because last runners are scaled down and won't pickup new jobs and the "first runners" will pickup jobs in the queue so I'm not sure how much this matters after all 🤔

We are using specifically RunnerSet because we want to persist volumes with some build related files when runners are down and then scaled back up (volumeClaimTemplates). I'm not sure how podManagementPolicy: Parallel would affect to the scale down policy, maybe it would just try to terminate all "extra replicas" at the same time, instead of one by one which would lead to better reaction in a case where there is 5 extra runners, it will try to stop all "extra 5" at the same time, instead of stopping the last one and after that stop the next "last one". Which leads to a situtation where the "second last one" might run jobs and the first "non-extra" runners might be idling. I'm probably gonna try out the Parallel policy as well.

I think there is also the other unrelated issue that causes scale-down triggered by the other jobs running in the cloud completing, but that should be a separate issue I think.

@mumoshu
Copy link
Collaborator

mumoshu commented Feb 24, 2022

This makes me think that scaling the RunnerSet (= using StatefulSet) is maybe not the best way to accomplish autoscaled runners at least if you don't have quite static usage for runners. Because you can't be sure which runners should be scaled down and it tries to scale down the last one

@tuomoa This! Thank you so much for the great summary ☺️

I was even wondering if creating one StatefulSet per stateful runner (So a RunnerSet corresponds to one or more StatefulSet) can be better.

@tuomoa
Copy link

tuomoa commented Feb 24, 2022

Yeah I think the best would be that HRA decides which runner pods should be scaled down instead of letting the k8 decide 🤔 But maybe when using the non-buggy ephemeral mode and Parallel pod policy this could still work pretty well (with the workaround for pod termination 😬 ). Does RunnerDeployment differ a lot? My guess is that it is pretty much the same, but k8 might terminate pods in unordered manner (not really sure how it decides which pods to stop when replicas is scaled?), so StatefulSet+Parallel pod policy will pretty much work similarly as RunnerDeployment? Well, StatefulSet will still stop "last replicas" but Deployment might stop newest or latest or whatever, but that does not matter for the termination logic still. Both will try to terminate some of the pods, which might be running a job or not?

@mumoshu
Copy link
Collaborator

mumoshu commented Feb 24, 2022

My guess is that it is pretty much the same

Yeah, pretty much the same once we made one RunnerSet corresponds to multiple StatefulSets. RunnerDeployment maintains one pod per job, while RunnerSet maintains one single-replica statefulset per job.

Well, StatefulSet will still stop "last replicas" but Deployment might stop newest or latest or whatever, but that does not matter for the termination logic still.

Sorry if I misread it but RunnerDeployment does not depend on K8s Deployment so we can implement whatever strategy there. RunnerDeployment currently "prefers" non-busy runners so it really worked pretty well, although you had some races between ARC and GitHub when a job is about to be scheduled onto the runner pod being terminated.

That said, #1127 enhances it in two ways. First, it doesn't try to recreate ephemeral runners eagerly anymore so that there will be less chance of a race between ARC and GitHub.

Second, it defers the removal of a runner pod for 60s by default- if the runner pod was ephemeral and was able to stop due to a successful job run in that 60s, there is zero chance of a race condition.

@tuomoa
Copy link

tuomoa commented Feb 24, 2022

All right okay, yeah I didn't take a closer look to the RunnerDeployments at least not yet so didn't really know how they worked, was pretty much guessing. But yeah that sounds like a better implementation 👍 Having multiple single-replica StatefulSets sounds a bit odd at first glance, but maybe it's the best way to do it still.

@tuomoa
Copy link

tuomoa commented Feb 24, 2022

With the preStop hax workaround it appears that jobs are finished before runners are killed.

But for some reason we are still experiencing issues with scaling. We have something like 20+ jobs and 5 runners in the current scenario I'm testing. Everything starts great, but then after 20-25 minutes suddenly calculated replicas starts to drop from 5 to eventually 0, even though runners are running jobs and there are jobs in the queue as well.

1.6456919833347774e+09	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 25	{"horizontalrunnerautoscaler": "default/runnerset-autoscaler", "suggested": 0, "reserved": 0, "min": 0, "cached": 0, "last_scale_up_time": "2022-02-24 08:30:17 +0000 UTC", "scale_down_delay_until": 1645692017, "max": 50}
...
1.6456920800427752e+09	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Calculated desired replicas of 1	{"horizontalrunnerautoscaler": "default/runnerset-autoscaler", "suggested": 0, "reserved": 1, "min": 0, "cached": 0, "max": 50}

I've been trying to debug what's causing this but still don't have any leads.. I tried to ignore the events without labels which was my first guess, but even though they are ignored in my custom build and not triggering scaledown anymore this happens, so I'm thinking there is something else going on. 🤔 Could there be something going on with the https://github.com/actions-runner-controller/actions-runner-controller/blob/6b12413fddbdff396d13391a64819b7e629f2ae4/controllers/horizontal_runner_autoscaler_webhook.go#L679 logic trying to release reserved capacity after some time has passed and it's triggering unnecessary scaledown too early?

@mumoshu
Copy link
Collaborator

mumoshu commented Feb 24, 2022

Everything starts great, but then after 20-25 minutes suddenly calculated replicas starts to drop from 5 to eventually 0, even though runners are running jobs and there are jobs in the queue as well.

@tuomoa What value do you set in HRA.Spec.ScaleTriggers[].Duration? It's the duration until the replica added by a workflow job "completed" event "expires". So, if your job takes 20 or 25 minutes in standard scenarios and if the scale trigger duration is lower than that duration, you'll end up expiring runner pods added by status=queued jobs earlier than status-completed jobs.

@tuomoa
Copy link

tuomoa commented Feb 24, 2022

@tuomoa What value do you set in HRA.Spec.ScaleTriggers[].Duration? It's the duration until the replica added by a workflow job "completed" event "expires". So, if your job takes 20 or 25 minutes in standard scenarios and if the scale trigger duration is lower than that duration, you'll end up expiring runner pods added by status=queued jobs earlier than status-completed jobs.

Sounds like that might explain it. That duration is currently 5 minutes and there are jobs taking longer than that. So should it be more than the longest running job?

Of note is the HRA.spec.scaleUpTriggers[].duration attribute. This attribute is used to calculate if the replica number added via the trigger is expired or not. On each reconcilation loop, the controller sums up all the non-expiring replica numbers from previous scale up triggers. It then compares the summed desired replica number against the current replica number. If the summed desired replica number > the current number then it means the replica count needs to scale up.

Or does this mean that it should be more than a guess of "longest job time on queue"+"longest job time" (which might be hard to determine, if there is maxReplicas limit)? I'm just thinking that when there is for example 20 jobs, the first 5 triggers scale-up, then those runners continue running the rest of the jobs, and then what happens after that duration has passed from the first scale-up of the first jobs?

If the point for that duration is to scale-down if we are missing the webhook events, could it be calculated from the pod startup time? At least when using ephemeral runners it would mean that the runner has been idling or is still running a job.

Edit: Based on my testing looks like the latter, which causes runners to be scaled down when there are still jobs in the queue, if the duration is not long enough (wait time+running for all the queued jobs when pipeline starts).

@mumoshu
Copy link
Collaborator

mumoshu commented Feb 24, 2022

So should it be more than the longest running job?

If you'd like ARC to not decrease the desired number of runners for 100% of cases, yes. Practically, it can be something that is longer than, say, 99% of the jobs, or 50%, depending on your requirement.

Or does this mean that it should be more than a guess of "longest job time on queue"+"longest job time"

Hmm, maybe. I can't say for sure quickly.

Anyway- The way ARC works around the scale trigger duration is that, every "capacity reservation" added to HRA by each scale trigger (on scale up) is included into the desired number of runners until it expires. On scale down, ARC removes the oldest "capacity reservation" saved in the HRA spec, so that the desired number decreases before the expiration.

So probably it's more like "longest job time".

So probably it's more like the longest "job time + queue time". If you enqueued 100 jobs at once, it results in 100 capacity reservations expires at the same time in the future. Let's say you had only 10 maxReplicas at the moment, 11th job will start running only after any of the first 10 jobs completes. The lower scale trigger duration, the more like a subet of 100 capacity reservations may expire before the final 100th job completes.

@tuomoa
Copy link

tuomoa commented Feb 24, 2022

So probably it's more like the longest "job time + queue time". If you enqueued 100 jobs at once, it results in 100 capacity reservations expires at the same time in the future. Let's say you had only 10 maxReplicas at the moment, 11th job will start running only after any of the first 10 jobs completes. The lower scale trigger duration, the more like a subet of 100 capacity reservations may expire before the final 100th job completes.

Yeah that sounds kinda hard if there is no constant flow of ~similarly time taking jobs and if max replicas limits the number of possible runners. I'm just thinking why don't we just increase some runners needed counter on queued event, decrease on completed event and when updating desired replicas cap the counter to maxReplicas. And then to avoid runners idling forever decrease desired replica count by the amount of runners which have been up for a configurable amount of time (max job running time). In this case if we miss all the completed events the (ephemeral!) runners would be scaled down after they have been idling for example 30 minutes or so. And if they picked up jobs from the queue, the 30 minutes starts again because pod is restarted (metadata creationTime changes). 🤔 This would be quite simple solution and it would avoid the problem with jobs queueing for different amount of times. I may have missed some details on my reasoning, not really that deep into this project/runners 😄 but what do you think?

For now probably the best RunnerSet configuration for our scenario (jobs running/in queue changing from 0-100 or so, not constant flow of jobs etc.) is something like this:

spec:
  ephemeral: true
  podManagementPolicy: Parallel
  ...
  template:
    spec:
      terminationGracePeriodSeconds: 1800
      containers:
      - name: runner
        image: summerwind/actions-runner-dind
        ...
        env:
          - name: RUNNER_FEATURE_FLAG_EPHEMERAL
            value: "true"
        lifecycle:
          preStop:
            exec:
              command:
                - "sh"
                - "-c"
                - "while [ $(ps -C Runner.Worker -o pid=) ]; do echo \"Still running\"; sleep 1; done && echo \"Runner is idling\""

and HRA:

spec:
  scaleTargetRef:
    kind: RunnerSet
    ...
  minReplicas: 0
  maxReplicas: 15
  metrics:
    - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
      repositoryNames:
        - ourrepos...

It will be a bit slower to react, but It will always guarantee queued jobs will be ran and it will (after a while) scale down when runners are idling. And also guarantee that runner process is not terminated if it was running a job.

  • preStop hook prevents terminating the runner if it picked up job before it was tried to be scaled down
  • Longer terminationGracePeriodSeconds (more than longest job) to give preStop hook time to finish if runner was running a job.
  • Parallel mode will allow creating multiple pods / terminating multiple pods at the same time instead of one-by-one (which takes more time).
  • RUNNER_FEATURE_FLAG_EPHEMERAL is important because without it the runner uses older --once which bugs in a way that runner might pick up a new job before stopping.

@mumoshu
Copy link
Collaborator

mumoshu commented Feb 25, 2022

I'm just thinking why don't we just increase some runners needed counter on queued event, decrease on completed event and when updating desired replicas cap the counter to maxReplicas. And then to avoid runners idling forever decrease desired replica count by the amount of runners which have been up for a configurable amount of time (max job running time). In this case if we miss all the completed events the (ephemeral!) runners would be scaled down after they have been idling for example 30 minutes or so. And if they picked up jobs from the queue, the 30 minutes starts again because pod is restarted (metadata creationTime changes). 🤔 This would be quite simple solution and it would avoid the problem with jobs queueing for different amount of times.

@tuomoa Yeah this would definitely be great! The only problem is that I don't know how it can be implemented cleanly.

In ARC's current model, RunnerDeployment doesn't even know when a replica is added (and when it expires). All it knows is the number of desired replicas. Runner doesn't even know how many siblings (that participates in the same "desired replicas" defined by the upstream RunnerDeployment).

Autoscaling is an optional feature so Runner and RunnerDeployment would be best if it has less features that makes sense only within the context of an autoscale. Something that can be expressed mostly by HRA would be nice.

There might be at least five cases to be handled:

  1. Received workflow_job event with status=queued then timed out(never receive corresponding completed event)
  2. Received workflow_job event with status=queued, then status=in_progress, and finally status=completed
  3. Received workflow_job event with status=queued then status=in_progress and finally status=completed, conclusion=skipped
  4. Received workflow_job event with status=queued, then status=in_progress, and time out(no completed event)
  5. Received workflow_job event with status=completed without the preceding one with status=queued

ARC's webhook autoscaler currently works by matching a workflow_job event to a corresponding HRA and RunnerDeployment by runner groups(if enabled) and labels, and adds a capacity reservation (of 1 replica) when it was status=queued and removes the oldest capacity reservation of same replica number when it was status=completed. Do nothing when it was status=in_progress.

Perhaps a chance would be there if we somehow set an expiration date for a capacity reservation only after we receive an event of status=in_progress? Then a capacity reservation hence the added replica's expiration timer starts only after the job starts running. WDYT?

@tuomoa
Copy link

tuomoa commented Feb 25, 2022

In ARC's current model, RunnerDeployment doesn't even know when a replica is added (and when it expires). All it knows is the number of desired replicas. Runner doesn't even know how many siblings (that participates in the same "desired replicas" defined by the upstream RunnerDeployment).

Yeah I suppose the current design does not really support that idea then. Not sure how big of a change that other kind of scaling would require. Maybe the HRA could look up the scale target's information and see how many pods there are currently running, what's the desired replicas over there and check if there is "expired runner pods in the replicas". At least the HRA Controller already fetches RunnerSet podList. Maybe the webhook scaler could use that as well instead of capacity reservation logic. But yeah I'm not that familiar with the code so do not really know well.

Perhaps a chance would be there if we somehow set an expiration date for a capacity reservation only after we receive an event of status=in_progress? Then a capacity reservation hence the added replica's expiration timer starts only after the job starts running. WDYT?

This might work as well. Although I'm not sure how we could handle possible missing status=in_progress any better than in current scenario (the 1. case in your examples), well, maybe there could be a long enough timeout (hours?) for that.. I suppose missing events does not happen so often anyways, so maybe it could be good enough 🤔

@mumoshu
Copy link
Collaborator

mumoshu commented Feb 25, 2022

maybe there could be a long enough timeout (hours?)

Yeah! That's my idea. Have a very long timeout like, say, 6h for missing status=in_progress, and 1h for the job timeout.

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 2, 2022

@genisd As I've replied to you in #1144, #1127 and #1167 would make your workaround of setting RestartPolicy=Always unnecessary

@jbkc85 As part of #920, I ended up touching various places in ARC codebase that is intended to alleviate those job timeout/cancellation issues. After #1127 and #1167, you should have much fewer chances of encountering it.

@tuomoa In #1167, I've enhanced it so that every runnerset-managed runner pod will never exit before it gets a runner ID assigned by GitHub. This should prevent any runner pod from racing between the registration and job assignment/running process and ARC terminating the pod. So regardless of how long the scale trigger duration is, ARC won't terminate already created ephemeral runner pods prematurely whie they're running any jobs.

Also in #1167, I've added a new field EffectiveTime to RunnerSet spec as well(I've added the same field to RunnerDeployment in #1127 but jut wanted to mention RunnerSet because you seem to be using it), which is used as an indicator to trigger the creation of missing runner pods. It adds a set of missing runner pods once after the specified EffectiveTime. The intention of the new field was to prevent ARC from unnecessarily recreating runner pods that are then immediately terminated due to scale down by worflow_job events of completed statuses. But it might be a step forward (although it isn't exactly what you expected) towards your idea below:

why don't we just increase some runners needed counter on queued event, decrease on completed event and when updating desired replicas

Maybe we'd add another facility to ARC that handles in_progress workflow job events to start the scale trigger duration timer instead.

The end result may be that, ARC "looks more like":

  • Adding one replica per "queued" workflow job event
  • that expires only after the scale trigger duration passed since it got a corresponding "in_progress" workflow job event,
  • that terminates before the expiration when it received "completed" workflow job event

It's just "more like" because we still have no way to stick a runner to a specific job. But it would be better than what we have today 😄

@jbkc85
Copy link
Author

jbkc85 commented Mar 2, 2022

@mumoshu the changes look great! I will have to give them a try in a while.

I did start looking into a different method of checking for proper runners though - sorta a 'cross check' in a manner of speaking. Basically the workflow API (client.Actions.ListRepositoryWorkflowRuns in the go-github package) provides us insight into a job, and that job actually has what runner its currently deployed on. During the deletion period you could actually reference all jobs running in a given repository, and from that you will have another API to reference 'busy runners'. This might be cool to tie into the existing runner checks, as it gives us a second look over the running workflows/jobs.

Example:

2022/03/02 14:59:48 [DEBUG] Total Count: 10273 for PR builder (in_progress)
2022/03/02 14:59:50 Job: testa Test on runners-cwtsk-tw6b5
2022/03/02 14:59:50 Job: testb Test on runners-cwtsk-r9nwk
2022/03/02 14:59:50 Job: testc Test on runners-cwtsk-8687d
-----workflow stats for xxxxxxxxxx-----Jobs Completed: 26Jobs Queued: 0Jobs In Progress: 3Jobs Unknown: 0

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 7, 2022

@jbkc85 Thank you for sharing your insight! I'd definitely reconsider what we can achieve with ListWorkflowRuns API now. We only used it for TotalNumberOfQueuedAndInProgressWorkflowRuns-based scale due to that to limit the number of API calls but since #1127 it should be cheap enough.

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 7, 2022

In #1180 I've added a hard-coded 10 minutes timeout after EffectiveTime. Let's see how it works. See the comment for more details on what problem #1180 is supposed to fix.

I'm still scratching my head around how EffectiveTime and #1180 should work with the potential enhancement on the scale trigger to make it expirable though.
I'd love your feedback on that too!

@toast-gear
Copy link
Collaborator

We're going to close this as this is fairly long running, various fixes added and it's now hard to track how fixed this is. Re-raise on the latest code if there is more work to be done on this issue.

@toast-gear toast-gear unpinned this issue Apr 6, 2022
@toast-gear toast-gear self-assigned this Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

10 participants