-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod stuck in NotReady state - RestartPolicy OnFailure? #1144
Comments
This patch works wonderfully here in my GKE environment. index 1bfa8bc..41f7180 100644
--- a/controllers/runner_controller.go
+++ b/controllers/runner_controller.go
@@ -922,7 +922,7 @@ func newRunnerPod(template corev1.Pod, runnerSpec v1alpha1.RunnerConfig, default
pod := template.DeepCopy()
if pod.Spec.RestartPolicy == "" {
- pod.Spec.RestartPolicy = "OnFailure"
+ pod.Spec.RestartPolicy = "Always"
}
if mtu := runnerSpec.DockerMTU; mtu != nil && dockerdInRunner { |
I thought the controller was supposed to restart the pod - not the pod itself? This way you can have it check for new tokens, etc. |
I trust the reasons are something like this. I'm not seeing any adverse effects at the moment |
runner_controller already restarts the runner pod(by recreating the whole pod once it completes) but that turned out to introduce another chance of race condition between ARC and GitHub.
And this seems impossible to me if everything worked correct 🤔
This was an interesting part to me. I thought K8s would mark the pod NotReady only when it started correctly but then failed due to readiness probe failures. That shouldn't happen in regular scenarios 🤔 |
This definitely happens. And I can kind of understand it from a Github point of view. There was a runner, it finished a job (just now) and I have queued jobs. It definitely was registered and running. Github does not appear to work under the assumption that the runner will disappear after finishing a job. Just to be sure, we are running organizational runners, with right now 1 (lageish) repository using the self-hosted runners.
Well there are no readiness probes configured here in my setup right now, I was looking into them check & restart the pod. The controller seems to be respecting that the runner is busy. |
@genisd Hey. Thanks a lot for sharing! Although Recently I've seen ARC unnecessarily recreating ephemeral runner pods when it's about to scale down(from say 100 to 10 replicas), which results in a race condition between ARC and GitHub. That is, ARC recreates ephemeral runner pods that exited with ARC as of today does block any "busy" runner from being terminated. But there seems to be an edge case that the runner isn't "busy" while a job is assigned to it and is about to run. I've fixed various race conditions around the case in #1127 and #1167. Hopefully you'll see less (or never) see this issue after those improvements. If we set |
I'm not running #1127 #1167 yet and due to holidays coming up I don't want to try these before. In the meantime I'm happy to close this issue, I wasn't even sure if it was the right place to begin with. |
@genisd Thanks for confirming! I'll await your comeback 😄 |
FYI, I also faced this issue while using the v0.22.0. I was using AWS EKS. When I switched to v0.21.0, I did not face this issue. In my case, the runner completes a job and exits with code 0, then the pod enters the NotReady state and Github removes it from self-hosted runner list. There was only one job and there was no other job alloted to the runner after the first job was executed by the runner which was successfully executed. I followed the exact steps from the README.md
Following is the
|
@shivamag00 I'm seeing the same behavior with v0.22.0, reverting to v0.21.0 fixes mine as well. With v0.22.0 it usually resolves itself after ~5 minutes. @mumoshu should we open a new issue for this? |
@nicholasgibson2 Yeah! Please. A full context (your configuration, logs from runners and ARC controller-manager, etc) would be appreciated. Otherwise I cant debug issues like this. Thanks in advance for your cooperation! |
i'm also seeing this. I've sometimes seen it take over 10 minutes for the NotReady pods to exit. When it's in NotReady state (which happens almost every time a job completes), the "runner" container has exited cleanly, but docker container is still running. |
@cspargo Thanks! A full context (your configuration, logs from runners and ARC controller-manager, etc), in a dedicated issue, would be appreciated 🙏 Because almost certainly it's a different issue than this one. (ARC's internal has changed considerably since then and NotReady is just the surface of the issue with hundreds of possible causes |
To solve this Instead of |
Describe the bug
The
runner
completes a job and exits with code 0, then the pod enters theNotReady
state.In the meantime GH actions allocated a job for this worker to pick up.
Pod will get cleaned up after some time.
Job never get's picked up, doesn't get scheduled elsewhere and just enters a failed state.
Checks
actions-runner-controller
is Yesterday's master (7156ce04
)image: summerwind/actions-runner:latest
as the runnerTo Reproduce
This behaviour might be specific to GKE perhaps?
We're currently running kubernetes
1.21.6
Scaling setup is webhooks only.
NotReady
state, with a job beeing queued from GH actions.Expected behavior
The pod should restart and pickup the next job if it has one allocated.
And it does so if I intervene manually (delete pod), the job get's picked up.
So I think the only needed change is
RestartPolicy
->Always
Screenshots
I'm sorry, the evidence disappeared from my screens, I'm currently running a fork.
If it's really needed I can produce this quite easily.
So currently the
RestartPolicy
isOnFailure
which triggers this behaviour.Most of the time pods get terminated on completion I think.
I think the right value should be
Always
(I'm currently running a fork from yesterdays master with this beeing the only change.)Is there a reason for it not to be
Always
for everyone at all times?Here kubernetes doesn't appear to restart it because the exitcode is 0.
RestartPolicy: Always
simply restarts it and it fixes the problem.I started a PR to make it configurable (there appear to be incomplete bits for this already), but I made a mistake somewhere.
So before I complete that work (to make it configurable), would it not be better to change the default?
The text was updated successfully, but these errors were encountered: