-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random Operation Cancelled
Runner Decommissions
#911
Comments
@jbkc85 Hey! Have you already tried the new ephemeral runners that can be enabled via Also- which flavor of GitHub are you using, GitHub Enterprise Server, or GitHub Cloud? |
Normal GitHub (thats cloud? Def not enterprise) mixed in with AWS EKS deployed runners. I have tried ephemeral and to no luck. Generally what happens is that during a scale up event, when the reconciler starts to run through its scaling down, it picks new pods not yet registered (because they haven't had a chance to be registered) and marks them for deletion. I might try Ephemeral again to see what we can do, as maybe we just didn't catch the issue earlier. I will get back to you! |
@jbkc85 We are also experiencing the same issue. have you found a solution to this? i am currently using stateful RunnerSet and webhook terminates wrong pod during a scale-down web-hook event. Was this working on versions before 0.20.1 ?? |
I am not sure it was ever 'working' depending on your workload. The fact is the reconciler reconciles nodes that have not yet 'registered', and when you are spinning up (and down) a ton of nodes, the registration falls behind due to API limits or otherwise, making some of your runners eventually reconciled prematurely. I am trying to pinpoint the exact cause of this in the code but I haven't had as much luck as I wished. One way to eliminate SOME of these issues is to set your |
@jbkc85 Thanks for the info!
I've read code but AFAIK this shouldn't happen as we use runner's CreationTimestamp for registration check timeout and it's reset by deleting/recreating the runner on scale-down/up.
I think this one is hard-coded today: Would you mind changing it and building a custom actions-runner-controller image using our makefile to see making it longer works for you?
I was thinking that we already guard against that case in In nutshell we log
What was your runner image tag in |
@antodoms Hey! Thanks for reporting. I think yours is a completely different issue. Would you mind creating another issue for it, so that we can better differentiate it and avoid other folks bringing similar but different issue reports here which is confusing? Anyway, I'm doubting if you're being affected by a potential bug in RunnerSet implementation. It's implemented separately from RunnerDeployment. RunnerDeployment has some logic to prevent busy runners from being terminated. I might have missed implementing simiilar logic for RunnerSet. |
@antodoms Hey! Have you used I reviewed both actions-runner-controller and actions/runner's (C#) code and I got to think that the use of ephemeral runners would solve your issue for RunnerSet. |
@jbkc85 Hey! Which runner image did you use, btw? I still think the use of new ephemeral runners would resolve your issue. So I was wondering if you used some outdated runner image that doesn't yet support If you could share the exact RunnerDeployment manifest you used to test RUNNER_FEATURE_FLAG_EPHEMERAL, i'd greatly appreciate it! Thanks. |
@mumoshu, basically i want to share the docker cache like images between multiple github runner pods, for which i thought RunnerSet is best, but now that i know RunnerSet uses StatefulSet i think it will create a new volume for each pod instead of sharing the volume between pod, so i am currently changing my runners to use RunnerDeployment with PersistentVolumeClaim using EFS. The only thing i am not able to do so far is reusing the existing images on the next pod, not sure if i am targetting the volume correctly. this is my current config
|
@antodoms Thanks! But, AFAIK,
So, all you could do would be to have a single persistent volume per pod and ensure |
@mumoshu I am so sorry, ive been out sick for the last ... well, pretty much week. I am going to try out a few of these things this week and see what happens. I do know that we used the ephemeral flag at one point, but believe we saw the same issues....I can try it again this week when I get back into action and see what happens! Thank you again for all the responses. I do apologize for leaving things high and dry w/ my absence. |
@mumoshu now I remember. We had Ephemeral on, but due to the constant checking of check-ins we actually ran ourselves out of GitHub API calls constantly. We are talking about a scaling from 1 to 130 nodes relatively quickly...so yea, the ephemeral runners hurt us there. I will still give this a shot whenever I get a chance. I do appreciate your communication and effort on this project! |
@jbkc85 Hey! Thanks for the response. Yeah. you mentioned that it might be due to API rate limit. But seeing runner and the controller's implementation/code, I'd say API rate limit shouldn't affect runner availability. It would be great if you could provide us more complete evidence of API rate limit affecting runner availability, or easy steps to reproduce your issue on anyone's environment, if possible, so that we can leverage our collective human resource to further debug it together! Anyway, API rate limit issue will mostly disappear once we managed to make #920. Thanks for suggesting that, too! |
Hello, I am experiencing something quite similar to what @jbkc85 described.
Please let me know if you have any suggestions of things we could try or other info you need to better understand it. Thank you! |
@gmambro Hey! Your log seems quite normal to me. Did you actually see |
The problem is that when the controller says |
@gmambro Hey! Thanks for confirming.
I reread this and all I can say is that this can happen when GitHub somehow failed to update the runner status. Here's actions-runner-controller's runner-controller code that detects this condition: So, this literally happens only when the runner is actually offline (as far as the controller can see from the response of the GitHub API call) Maybe this is caused by GitHub, rather than by actions-runner-controller? |
@gmambro Another thing I wanted to confirm is- Are you sure there's no suspicious event on the runner pod? I guess this can also happen when it's running on an EC2 spot instance or a GCP preemptive VM or alike, that forcefully terminates the runner pod when the VM gets automatically stopped. |
This is what I was afraid of, so I wondered if there were are thing we could check in the logs before calling it a GH issue. At first I thought of rate limit but after looking at this discussion I found that is not the case it will be logged in another way. |
@gmambro Thanks! Yeah- as you've discovered the controller should print specific log message when it sees a GitHUb API rate limit. A dirty way to debug this further before filing an issue against GitHub would be this: you can try spamming |
Quick update. There is no external autoscaling thing that kills pods or nodes. Indeed the problem seems to be in APIs: the runner which gets killed is reported as offline. However, there is an interesting thing: the runner is offline but busy at the same time. This can probably help in detecting these anomalies (e.g. don't delete offline runners if they are busy) |
we have used fail fast, but in the situations where we see a random cancellation its not this particular issue.
what @gmambro said. Every time I have looked into it, I have seen problems with the node being 'busy', but still being marked for deletion. I am trying to find more logs from the time it happened, but its hard to catch in the act per its randomness. |
@jbkc85 to me looks like GH API gives inconsistent status for the runner which will be killed, as I used a small client to monitor it and I found that it was marked as offline but busy. |
@gmambro Wow, this is great news! If that's the only edge case I need to be aware of, it won't be that hard to implement a workaround in the actions-runner-conroller side. It can be as easy as updating this line to |
@gmambro I'd appreciate it if you could share your additional insight on this- I'm now wondering if it ever becomes The current actions-runner-controller design assumes that a runner's life is like:
How does it look like for your problematic runner? Is it like:
or:
Honestly speaking I'm not fully sure how I'd reflect that information to the controller's design yet. |
At some point, they switch from online/busy to offline/busy and then a few minutes later the controller kills them. Let me run some more tests and I will come back. |
Since today I wasn't able to repro this issue, so still don't have as many experiments as I'd wish to share with you.
I wonder if the runners can become temporarily offline/busy because of the job consuming all of its resources. If this behaviour is confirmed it would be nice to set something like a grace period for offline/busy to give them a chance to go back online. |
So, got some more repros, the story looks like this:
|
@mumoshu I gave it a go and it seems to work a little bit better, but still something odd happens. Out of scaled-up 8 runners 2 failed in similar manner.
But still the runner pod is actually terminated (when it was running a job). I think it was terminated before the
Also I observed some other weird behaviour. For some reason the calculated desired replicas suddenly drops from 15->1 when the pipeline is running and there are also queued jobs:
I tried to track this down by looking at the webhook log, and it shows that when it receives
I'm not sure why this happens, but could it be that https://github.com/actions-runner-controller/actions-runner-controller/blob/0b9bef2c086a54b0731fa7aa0a6bd7cb856b48ee/controllers/horizontal_runner_autoscaler_webhook.go#L695 doesn't really consider the case that labels in the event are empty (jobs running in cloud, and some jobs running in EDIT: I added some debug logging to my custom built image. I'm still not sure what happens, but I'm now thinking that is there a problem when https://github.com/actions-runner-controller/actions-runner-controller/blob/7156ce040ec65dd80863f390f8397dbf01c8466d/controllers/runnerset_controller.go#L216 patches new desired replicas (maybe because calculating it wrong based on the issue above?) and it doesn't consider correctly that runners are busy and starts to terminate pods, which then get deleted later on? Or should the runner-pod finalizer make sure that pod is not terminated before it's not busy, or is it now preventing deletion instead of termination?
Maybe it starts scaling down, then pod enters terminating state and then pod deletion is prevented by the finalizer as long as it says it's busy. Finally deletion times out and it's forcefully deleted.? Is there a way to make finalizer prevent actual termination? |
I built master and am testing/running this as we speak. With only webhooks (not metrics) and 2 (used) runner configurations |
This issue kinda crept up, while I saw this happening before it perhaps happened quicker/more often. So if I configured k8s to restart the pod with |
I don't think our issue is related to that. According to my testing the issue appears this way:
I think before the fix which was merged to master the finalizer didn't prevent pod deletion because runner was offline and busy (to not get stuck?). So before the same thing happened but faster, because pod was terminated (because I stumbled across this actions/runner#1658 (comment) and tried a similar workaround:
This will prevent This makes me think that scaling the We are using specifically I think there is also the other unrelated issue that causes scale-down triggered by the other jobs running in the cloud completing, but that should be a separate issue I think. |
@tuomoa This! Thank you so much for the great summary I was even wondering if creating one StatefulSet per stateful runner (So a RunnerSet corresponds to one or more StatefulSet) can be better. |
Yeah I think the best would be that HRA decides which runner pods should be scaled down instead of letting the k8 decide 🤔 But maybe when using the non-buggy ephemeral mode and Parallel pod policy this could still work pretty well (with the workaround for pod termination 😬 ). Does RunnerDeployment differ a lot? My guess is that it is pretty much the same, but k8 might terminate pods in unordered manner (not really sure how it decides which pods to stop when replicas is scaled?), so StatefulSet+Parallel pod policy will pretty much work similarly as RunnerDeployment? Well, StatefulSet will still stop "last replicas" but Deployment might stop newest or latest or whatever, but that does not matter for the termination logic still. Both will try to terminate some of the pods, which might be running a job or not? |
Yeah, pretty much the same once we made one RunnerSet corresponds to multiple StatefulSets. RunnerDeployment maintains one pod per job, while RunnerSet maintains one single-replica statefulset per job.
Sorry if I misread it but RunnerDeployment does not depend on K8s Deployment so we can implement whatever strategy there. RunnerDeployment currently "prefers" non-busy runners so it really worked pretty well, although you had some races between ARC and GitHub when a job is about to be scheduled onto the runner pod being terminated. That said, #1127 enhances it in two ways. First, it doesn't try to recreate ephemeral runners eagerly anymore so that there will be less chance of a race between ARC and GitHub. Second, it defers the removal of a runner pod for 60s by default- if the runner pod was ephemeral and was able to stop due to a successful job run in that 60s, there is zero chance of a race condition. |
All right okay, yeah I didn't take a closer look to the RunnerDeployments at least not yet so didn't really know how they worked, was pretty much guessing. But yeah that sounds like a better implementation 👍 Having multiple single-replica StatefulSets sounds a bit odd at first glance, but maybe it's the best way to do it still. |
With the But for some reason we are still experiencing issues with scaling. We have something like 20+ jobs and 5 runners in the current scenario I'm testing. Everything starts great, but then after 20-25 minutes suddenly calculated replicas starts to drop from 5 to eventually 0, even though runners are running jobs and there are jobs in the queue as well.
I've been trying to debug what's causing this but still don't have any leads.. I tried to ignore the events without labels which was my first guess, but even though they are ignored in my custom build and not triggering scaledown anymore this happens, so I'm thinking there is something else going on. 🤔 Could there be something going on with the https://github.com/actions-runner-controller/actions-runner-controller/blob/6b12413fddbdff396d13391a64819b7e629f2ae4/controllers/horizontal_runner_autoscaler_webhook.go#L679 logic trying to release reserved capacity after some time has passed and it's triggering unnecessary scaledown too early? |
@tuomoa What value do you set in |
Sounds like that might explain it. That duration is currently 5 minutes and there are jobs taking longer than that. So should it be more than the longest running job?
Or does this mean that it should be more than a guess of "longest job time on queue"+"longest job time" (which might be hard to determine, if there is If the point for that duration is to scale-down if we are missing the webhook events, could it be calculated from the pod startup time? At least when using ephemeral runners it would mean that the runner has been idling or is still running a job. Edit: Based on my testing looks like the latter, which causes runners to be scaled down when there are still jobs in the queue, if the duration is not long enough (wait time+running for all the queued jobs when pipeline starts). |
If you'd like ARC to not decrease the desired number of runners for 100% of cases, yes. Practically, it can be something that is longer than, say, 99% of the jobs, or 50%, depending on your requirement.
Hmm, maybe. I can't say for sure quickly. Anyway- The way ARC works around the scale trigger duration is that, every "capacity reservation" added to HRA by each scale trigger (on scale up) is included into the desired number of runners until it expires. On scale down, ARC removes the oldest "capacity reservation" saved in the HRA spec, so that the desired number decreases before the expiration.
So probably it's more like the longest "job time + queue time". If you enqueued 100 jobs at once, it results in 100 capacity reservations expires at the same time in the future. Let's say you had only 10 maxReplicas at the moment, 11th job will start running only after any of the first 10 jobs completes. The lower scale trigger duration, the more like a subet of 100 capacity reservations may expire before the final 100th job completes. |
Yeah that sounds kinda hard if there is no constant flow of ~similarly time taking jobs and if max replicas limits the number of possible runners. I'm just thinking why don't we just increase some runners needed counter on For now probably the best
and HRA:
It will be a bit slower to react, but It will always guarantee queued jobs will be ran and it will (after a while) scale down when runners are idling. And also guarantee that runner process is not terminated if it was running a job.
|
@tuomoa Yeah this would definitely be great! The only problem is that I don't know how it can be implemented cleanly. In ARC's current model, RunnerDeployment doesn't even know when a replica is added (and when it expires). All it knows is the number of desired replicas. Runner doesn't even know how many siblings (that participates in the same "desired replicas" defined by the upstream RunnerDeployment). Autoscaling is an optional feature so Runner and RunnerDeployment would be best if it has less features that makes sense only within the context of an autoscale. Something that can be expressed mostly by HRA would be nice. There might be at least five cases to be handled:
ARC's webhook autoscaler currently works by matching a Perhaps a chance would be there if we somehow set an expiration date for a capacity reservation only after we receive an event of |
Yeah I suppose the current design does not really support that idea then. Not sure how big of a change that other kind of scaling would require. Maybe the HRA could look up the scale target's information and see how many pods there are currently running, what's the desired replicas over there and check if there is "expired runner pods in the replicas". At least the HRA Controller already fetches RunnerSet podList. Maybe the webhook scaler could use that as well instead of capacity reservation logic. But yeah I'm not that familiar with the code so do not really know well.
This might work as well. Although I'm not sure how we could handle possible missing |
Yeah! That's my idea. Have a very long timeout like, say, 6h for missing status=in_progress, and 1h for the job timeout. |
@genisd As I've replied to you in #1144, #1127 and #1167 would make your workaround of setting RestartPolicy=Always unnecessary @jbkc85 As part of #920, I ended up touching various places in ARC codebase that is intended to alleviate those job timeout/cancellation issues. After #1127 and #1167, you should have much fewer chances of encountering it. @tuomoa In #1167, I've enhanced it so that every runnerset-managed runner pod will never exit before it gets a runner ID assigned by GitHub. This should prevent any runner pod from racing between the registration and job assignment/running process and ARC terminating the pod. So regardless of how long the scale trigger duration is, ARC won't terminate already created ephemeral runner pods prematurely whie they're running any jobs. Also in #1167, I've added a new field
Maybe we'd add another facility to ARC that handles The end result may be that, ARC "looks more like":
It's just "more like" because we still have no way to stick a runner to a specific job. But it would be better than what we have today 😄 |
@mumoshu the changes look great! I will have to give them a try in a while. I did start looking into a different method of checking for proper runners though - sorta a 'cross check' in a manner of speaking. Basically the workflow API (client.Actions.ListRepositoryWorkflowRuns in the go-github package) provides us insight into a job, and that job actually has what runner its currently deployed on. During the deletion period you could actually reference all jobs running in a given repository, and from that you will have another API to reference 'busy runners'. This might be cool to tie into the existing runner checks, as it gives us a second look over the running workflows/jobs. Example:
|
In #1180 I've added a hard-coded 10 minutes timeout after EffectiveTime. Let's see how it works. See the comment for more details on what problem #1180 is supposed to fix. I'm still scratching my head around how EffectiveTime and #1180 should work with the potential enhancement on the scale trigger to make it expirable though. |
We're going to close this as this is fairly long running, various fixes added and it's now hard to track how fixed this is. Re-raise on the latest code if there is more work to be done on this issue. |
Describe the bug
In the current deployment of actions-runner-controller, whenever a GitHub runner has an action cancelled (via GitHub UI) or has an action that fails (like a unit test), it seems to cause a cascading effect that triggers other random cancellations of actions running on other runners in the system. Furthermore, there is a high chance that at the end of a GitHub workflow that one of the last remaining runners (IE: a runner that survives until a scale down is triggered via reconciling) is cancelled while 'busy' with a Workflow Job/Step.
Checks
summerwind/actions-runner-controller:v0.20.1
and use the the provided GitHub-runner with our own custom OpenJDK installationTo Reproduce
Steps to reproduce the behavior (though is more random than not):
Expected behavior
The expectation is that the jobs continue - regardless of success/failure and the workflow succeeds in finishing the appointed jobs/tasks.
Screenshots
No screenshots, but the lines simply output
Error: The Operation was canceled
.Environment (please complete the following information):
Additional context
The workflow is extremely active and can have anywhere between 32-64 running jobs. Each job takes between 6 and 20 minutes, depending on the test it is deploying. What I have seen is that it appears that the logs for 'unregistered' runners have an impact once those runners are to be reconciled. I see a lot of:
and based on the behavior, I wonder if what's happening is that the reconciler takes a registration timeout of a previous pod and applies it to these as they are trying to catch up - thus resulting in a runner being marked for deletion even though it is only a minute or two into its registration check interval.
With that being said I was hoping to increase these intervals and have more of a grace period between the controller just shutting things down where it doesn't 'know' if the runner is busy or not, but honestly I can't tell if thats what is happening.
Any insight would be helpful - and let me know if there is anything else I can provide!
The text was updated successfully, but these errors were encountered: