Pod stuck in NotReady state - RestartPolicy OnFailure? #1144

genisd · 2022-02-22T14:43:33Z

Describe the bug
The runner completes a job and exits with code 0, then the pod enters the NotReady state.
In the meantime GH actions allocated a job for this worker to pick up.
Pod will get cleaned up after some time.
Job never get's picked up, doesn't get scheduled elsewhere and just enters a failed state.

Checks

[ x ] My actions-runner-controller version (v0.x.y) does support the feature
[ x ] I'm using an unreleased version of the controller I built from HEAD of the default branch
actions-runner-controller is Yesterday's master (7156ce04)
image: summerwind/actions-runner:latest as the runner

To Reproduce
This behaviour might be specific to GKE perhaps?
We're currently running kubernetes 1.21.6
Scaling setup is webhooks only.

Create a RunnerDeployment with emphemeral: true
Run jobs
At about the time the jobs from (2.) complete, schedule more jobs.
Sometimes a pod will linger behind in the NotReady state, with a job beeing queued from GH actions.

Expected behavior
The pod should restart and pickup the next job if it has one allocated.
And it does so if I intervene manually (delete pod), the job get's picked up.
So I think the only needed change is RestartPolicy -> Always

Screenshots
I'm sorry, the evidence disappeared from my screens, I'm currently running a fork.
If it's really needed I can produce this quite easily.

So currently the RestartPolicy is OnFailure which triggers this behaviour.
Most of the time pods get terminated on completion I think.

I think the right value should be Always (I'm currently running a fork from yesterdays master with this beeing the only change.)

Is there a reason for it not to be Always for everyone at all times?
Here kubernetes doesn't appear to restart it because the exitcode is 0. RestartPolicy: Always simply restarts it and it fixes the problem.

I started a PR to make it configurable (there appear to be incomplete bits for this already), but I made a mistake somewhere.
So before I complete that work (to make it configurable), would it not be better to change the default?

The text was updated successfully, but these errors were encountered:

genisd · 2022-02-22T14:56:34Z

This patch works wonderfully here in my GKE environment.

index 1bfa8bc..41f7180 100644
--- a/controllers/runner_controller.go
+++ b/controllers/runner_controller.go
@@ -922,7 +922,7 @@ func newRunnerPod(template corev1.Pod, runnerSpec v1alpha1.RunnerConfig, default
        pod := template.DeepCopy()
 
        if pod.Spec.RestartPolicy == "" {
-               pod.Spec.RestartPolicy = "OnFailure"
+               pod.Spec.RestartPolicy = "Always"
        }
 
        if mtu := runnerSpec.DockerMTU; mtu != nil && dockerdInRunner {

jbkc85 · 2022-02-22T15:44:59Z

This patch works wonderfully here in my GKE environment.

index 1bfa8bc..41f7180 100644
--- a/controllers/runner_controller.go
+++ b/controllers/runner_controller.go
@@ -922,7 +922,7 @@ func newRunnerPod(template corev1.Pod, runnerSpec v1alpha1.RunnerConfig, default
        pod := template.DeepCopy()
 
        if pod.Spec.RestartPolicy == "" {
-               pod.Spec.RestartPolicy = "OnFailure"
+               pod.Spec.RestartPolicy = "Always"
        }
 
        if mtu := runnerSpec.DockerMTU; mtu != nil && dockerdInRunner {

I thought the controller was supposed to restart the pod - not the pod itself? This way you can have it check for new tokens, etc.

genisd · 2022-02-23T10:05:01Z

I trust the reasons are something like this. I'm not seeing any adverse effects at the moment
I'm not sure if a general issue is best, perhaps i should discuss it in #911

mumoshu · 2022-02-23T23:32:14Z

So I think the only needed change is RestartPolicy -> Always

runner_controller already restarts the runner pod(by recreating the whole pod once it completes) but that turned out to introduce another chance of race condition between ARC and GitHub.

In the meantime GH actions allocated a job for this worker to pick up.

And this seems impossible to me if everything worked correct 🤔
My assumption was GitHub would assign a queued job to a runner only when it's registered and running.
Perhaps that assumption doesn't stand in some edge cases?

The runner completes a job and exits with code 0, then the pod enters the NotReady state.

This was an interesting part to me. I thought K8s would mark the pod NotReady only when it started correctly but then failed due to readiness probe failures. That shouldn't happen in regular scenarios 🤔

genisd · 2022-03-01T12:24:32Z

In the meantime GH actions allocated a job for this worker to pick up.

And this seems impossible to me if everything worked correct thinking My assumption was GitHub would assign a queued job to a runner only when it's registered and running. Perhaps that assumption doesn't stand in some edge cases?

This definitely happens. And I can kind of understand it from a Github point of view. There was a runner, it finished a job (just now) and I have queued jobs. It definitely was registered and running. Github does not appear to work under the assumption that the runner will disappear after finishing a job.

Just to be sure, we are running organizational runners, with right now 1 (lageish) repository using the self-hosted runners.

The runner completes a job and exits with code 0, then the pod enters the NotReady state.

This was an interesting part to me. I thought K8s would mark the pod NotReady only when it started correctly but then failed due to readiness probe failures. That shouldn't happen in regular scenarios thinking.

Well there are no readiness probes configured here in my setup right now, I was looking into them check & restart the pod.
Setting the RestartPolicy to Always was just a little easier to achieve (as a workaround). It works perfectly infact for this specific issue/bug.

The controller seems to be respecting that the runner is busy.
I don't know the code, but could it be that it does not get restarted because it has a job queued to it? controller might not want to touch it in this case (rightfully so).

mumoshu · 2022-03-02T00:06:20Z

@genisd Hey. Thanks a lot for sharing!

Although RestartPolicy=Always looks good in certain cases, it's not perfect.

Recently I've seen ARC unnecessarily recreating ephemeral runner pods when it's about to scale down(from say 100 to 10 replicas), which results in a race condition between ARC and GitHub. That is, ARC recreates ephemeral runner pods that exited with 0 to complement the desired number of runners, then HRA scales it down, ARC removes now-unnecessary ephemeral runner pods. However the ephemeral runner pods were about to start and some of them gets workflow jobs assigned by GitHub, but then terminated prematurely due to that ARC removed runner pods.

ARC as of today does block any "busy" runner from being terminated. But there seems to be an edge case that the runner isn't "busy" while a job is assigned to it and is about to run.

I've fixed various race conditions around the case in #1127 and #1167. Hopefully you'll see less (or never) see this issue after those improvements.

If we set RestartPolicy=Always, it might still result in a similar race condition, because then ARC has no way to coordinate a pod restart/recreate with a scale-down process. ARC has to make sure to not terminate the runner which is about to run a job. That's impossible with RestartPolicy=Always, because it makes the timing of a restart uncontrollable from ARC.

genisd · 2022-03-02T14:48:56Z

I'm not running #1127 #1167 yet and due to holidays coming up I don't want to try these before.
I'll be back on 21st of March and then I'll give the latest and greatest master (or perhaps release) a go.

In the meantime I'm happy to close this issue, I wasn't even sure if it was the right place to begin with.
It's related specifically to a master commit/state and might not be relevant really for any(?) tagged release.

mumoshu · 2022-03-03T00:12:07Z

@genisd Thanks for confirming! I'll await your comeback 😄

shivamag00 · 2022-03-28T19:00:50Z

FYI, I also faced this issue while using the v0.22.0. I was using AWS EKS. When I switched to v0.21.0, I did not face this issue.

In my case, the runner completes a job and exits with code 0, then the pod enters the NotReady state and Github removes it from self-hosted runner list. There was only one job and there was no other job alloted to the runner after the first job was executed by the runner which was successfully executed.

I followed the exact steps from the README.md

Installed cert-manager using kubectl
kubectl deployment for CRDs and actions-runner-controller
Used Github PAT with kubectl secret for authentication

Following is the runner.yaml that I used

apiVersion: actions.summerwind.dev/v1alpha1
kind: Runner
metadata:
  name: aws-eks-runner
spec:
  repository: some-github-repo
  env: []

nicholasgibson2 · 2022-03-28T20:52:49Z

@shivamag00 I'm seeing the same behavior with v0.22.0, reverting to v0.21.0 fixes mine as well. With v0.22.0 it usually resolves itself after ~5 minutes. @mumoshu should we open a new issue for this?

mumoshu · 2022-03-28T22:46:48Z

@nicholasgibson2 Yeah! Please. A full context (your configuration, logs from runners and ARC controller-manager, etc) would be appreciated. Otherwise I cant debug issues like this. Thanks in advance for your cooperation!

cspargo · 2022-03-29T06:27:36Z

i'm also seeing this. I've sometimes seen it take over 10 minutes for the NotReady pods to exit.

When it's in NotReady state (which happens almost every time a job completes), the "runner" container has exited cleanly, but docker container is still running.

mumoshu · 2022-03-29T06:40:19Z

@cspargo Thanks! A full context (your configuration, logs from runners and ARC controller-manager, etc), in a dedicated issue, would be appreciated 🙏 Because almost certainly it's a different issue than this one. (ARC's internal has changed considerably since then and NotReady is just the surface of the issue with hundreds of possible causes

nicholasgibson2 · 2022-03-30T19:59:27Z

@mumoshu sorry for the delay, here's the new issue: #1291 let me know if you need any more info or want me to test anything on my end.

p.s. my coworkers and I are loving this project!

maheshkhade · 2022-08-11T09:53:09Z

To solve this Instead of kind: Runner I used kind: RunnerDeployment

genisd mentioned this issue Feb 22, 2022

Ephemeral mode with runner Controller does not work as expected #1044

Closed

2 tasks

genisd mentioned this issue Feb 23, 2022

Random Operation Cancelled Runner Decommissions #911

Closed

genisd closed this as completed Mar 2, 2022

shivamag00 mentioned this issue Mar 28, 2022

Runners stuck in Terminated/Terminating state using v0.22 #1273

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod stuck in NotReady state - RestartPolicy OnFailure? #1144

Pod stuck in NotReady state - RestartPolicy OnFailure? #1144

genisd commented Feb 22, 2022 •

edited

Loading

genisd commented Feb 22, 2022

jbkc85 commented Feb 22, 2022

genisd commented Feb 23, 2022

mumoshu commented Feb 23, 2022

genisd commented Mar 1, 2022 •

edited

Loading

mumoshu commented Mar 2, 2022

genisd commented Mar 2, 2022

mumoshu commented Mar 3, 2022

shivamag00 commented Mar 28, 2022 •

edited

Loading

nicholasgibson2 commented Mar 28, 2022

mumoshu commented Mar 28, 2022

cspargo commented Mar 29, 2022

mumoshu commented Mar 29, 2022

nicholasgibson2 commented Mar 30, 2022

maheshkhade commented Aug 11, 2022

Pod stuck in NotReady state - RestartPolicy OnFailure? #1144

Pod stuck in NotReady state - RestartPolicy OnFailure? #1144

Comments

genisd commented Feb 22, 2022 • edited Loading

genisd commented Feb 22, 2022

jbkc85 commented Feb 22, 2022

genisd commented Feb 23, 2022

mumoshu commented Feb 23, 2022

genisd commented Mar 1, 2022 • edited Loading

mumoshu commented Mar 2, 2022

genisd commented Mar 2, 2022

mumoshu commented Mar 3, 2022

shivamag00 commented Mar 28, 2022 • edited Loading

nicholasgibson2 commented Mar 28, 2022

mumoshu commented Mar 28, 2022

cspargo commented Mar 29, 2022

mumoshu commented Mar 29, 2022

nicholasgibson2 commented Mar 30, 2022

maheshkhade commented Aug 11, 2022

genisd commented Feb 22, 2022 •

edited

Loading

genisd commented Mar 1, 2022 •

edited

Loading

shivamag00 commented Mar 28, 2022 •

edited

Loading