Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod stuck in NotReady state - RestartPolicy OnFailure? #1144

Closed
genisd opened this issue Feb 22, 2022 · 15 comments
Closed

Pod stuck in NotReady state - RestartPolicy OnFailure? #1144

genisd opened this issue Feb 22, 2022 · 15 comments

Comments

@genisd
Copy link
Contributor

genisd commented Feb 22, 2022

Describe the bug
The runner completes a job and exits with code 0, then the pod enters the NotReady state.
In the meantime GH actions allocated a job for this worker to pick up.
Pod will get cleaned up after some time.
Job never get's picked up, doesn't get scheduled elsewhere and just enters a failed state.

Checks

  • [ x ] My actions-runner-controller version (v0.x.y) does support the feature
  • [ x ] I'm using an unreleased version of the controller I built from HEAD of the default branch
    actions-runner-controller is Yesterday's master (7156ce04)
    image: summerwind/actions-runner:latest as the runner

To Reproduce
This behaviour might be specific to GKE perhaps?
We're currently running kubernetes 1.21.6
Scaling setup is webhooks only.

  1. Create a RunnerDeployment with emphemeral: true
  2. Run jobs
  3. At about the time the jobs from (2.) complete, schedule more jobs.
  4. Sometimes a pod will linger behind in the NotReady state, with a job beeing queued from GH actions.

Expected behavior
The pod should restart and pickup the next job if it has one allocated.
And it does so if I intervene manually (delete pod), the job get's picked up.
So I think the only needed change is RestartPolicy -> Always

Screenshots
I'm sorry, the evidence disappeared from my screens, I'm currently running a fork.
If it's really needed I can produce this quite easily.

So currently the RestartPolicy is OnFailure which triggers this behaviour.
Most of the time pods get terminated on completion I think.

I think the right value should be Always (I'm currently running a fork from yesterdays master with this beeing the only change.)

Is there a reason for it not to be Always for everyone at all times?
Here kubernetes doesn't appear to restart it because the exitcode is 0. RestartPolicy: Always simply restarts it and it fixes the problem.

I started a PR to make it configurable (there appear to be incomplete bits for this already), but I made a mistake somewhere.
So before I complete that work (to make it configurable), would it not be better to change the default?

@genisd
Copy link
Contributor Author

genisd commented Feb 22, 2022

This patch works wonderfully here in my GKE environment.

index 1bfa8bc..41f7180 100644
--- a/controllers/runner_controller.go
+++ b/controllers/runner_controller.go
@@ -922,7 +922,7 @@ func newRunnerPod(template corev1.Pod, runnerSpec v1alpha1.RunnerConfig, default
        pod := template.DeepCopy()
 
        if pod.Spec.RestartPolicy == "" {
-               pod.Spec.RestartPolicy = "OnFailure"
+               pod.Spec.RestartPolicy = "Always"
        }
 
        if mtu := runnerSpec.DockerMTU; mtu != nil && dockerdInRunner {

@jbkc85
Copy link

jbkc85 commented Feb 22, 2022

This patch works wonderfully here in my GKE environment.

index 1bfa8bc..41f7180 100644
--- a/controllers/runner_controller.go
+++ b/controllers/runner_controller.go
@@ -922,7 +922,7 @@ func newRunnerPod(template corev1.Pod, runnerSpec v1alpha1.RunnerConfig, default
        pod := template.DeepCopy()
 
        if pod.Spec.RestartPolicy == "" {
-               pod.Spec.RestartPolicy = "OnFailure"
+               pod.Spec.RestartPolicy = "Always"
        }
 
        if mtu := runnerSpec.DockerMTU; mtu != nil && dockerdInRunner {

I thought the controller was supposed to restart the pod - not the pod itself? This way you can have it check for new tokens, etc.

@genisd
Copy link
Contributor Author

genisd commented Feb 23, 2022

I trust the reasons are something like this. I'm not seeing any adverse effects at the moment
I'm not sure if a general issue is best, perhaps i should discuss it in #911

@mumoshu
Copy link
Collaborator

mumoshu commented Feb 23, 2022

So I think the only needed change is RestartPolicy -> Always

runner_controller already restarts the runner pod(by recreating the whole pod once it completes) but that turned out to introduce another chance of race condition between ARC and GitHub.

In the meantime GH actions allocated a job for this worker to pick up.

And this seems impossible to me if everything worked correct 🤔
My assumption was GitHub would assign a queued job to a runner only when it's registered and running.
Perhaps that assumption doesn't stand in some edge cases?

The runner completes a job and exits with code 0, then the pod enters the NotReady state.

This was an interesting part to me. I thought K8s would mark the pod NotReady only when it started correctly but then failed due to readiness probe failures. That shouldn't happen in regular scenarios 🤔

@genisd
Copy link
Contributor Author

genisd commented Mar 1, 2022

In the meantime GH actions allocated a job for this worker to pick up.

And this seems impossible to me if everything worked correct thinking My assumption was GitHub would assign a queued job to a runner only when it's registered and running. Perhaps that assumption doesn't stand in some edge cases?

This definitely happens. And I can kind of understand it from a Github point of view. There was a runner, it finished a job (just now) and I have queued jobs. It definitely was registered and running. Github does not appear to work under the assumption that the runner will disappear after finishing a job.

Just to be sure, we are running organizational runners, with right now 1 (lageish) repository using the self-hosted runners.

The runner completes a job and exits with code 0, then the pod enters the NotReady state.

This was an interesting part to me. I thought K8s would mark the pod NotReady only when it started correctly but then failed due to readiness probe failures. That shouldn't happen in regular scenarios thinking.

Well there are no readiness probes configured here in my setup right now, I was looking into them check & restart the pod.
Setting the RestartPolicy to Always was just a little easier to achieve (as a workaround). It works perfectly infact for this specific issue/bug.

The controller seems to be respecting that the runner is busy.
I don't know the code, but could it be that it does not get restarted because it has a job queued to it? controller might not want to touch it in this case (rightfully so).

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 2, 2022

@genisd Hey. Thanks a lot for sharing!

Although RestartPolicy=Always looks good in certain cases, it's not perfect.

Recently I've seen ARC unnecessarily recreating ephemeral runner pods when it's about to scale down(from say 100 to 10 replicas), which results in a race condition between ARC and GitHub. That is, ARC recreates ephemeral runner pods that exited with 0 to complement the desired number of runners, then HRA scales it down, ARC removes now-unnecessary ephemeral runner pods. However the ephemeral runner pods were about to start and some of them gets workflow jobs assigned by GitHub, but then terminated prematurely due to that ARC removed runner pods.

ARC as of today does block any "busy" runner from being terminated. But there seems to be an edge case that the runner isn't "busy" while a job is assigned to it and is about to run.

I've fixed various race conditions around the case in #1127 and #1167. Hopefully you'll see less (or never) see this issue after those improvements.

If we set RestartPolicy=Always, it might still result in a similar race condition, because then ARC has no way to coordinate a pod restart/recreate with a scale-down process. ARC has to make sure to not terminate the runner which is about to run a job. That's impossible with RestartPolicy=Always, because it makes the timing of a restart uncontrollable from ARC.

@genisd
Copy link
Contributor Author

genisd commented Mar 2, 2022

I'm not running #1127 #1167 yet and due to holidays coming up I don't want to try these before.
I'll be back on 21st of March and then I'll give the latest and greatest master (or perhaps release) a go.

In the meantime I'm happy to close this issue, I wasn't even sure if it was the right place to begin with.
It's related specifically to a master commit/state and might not be relevant really for any(?) tagged release.

@genisd genisd closed this as completed Mar 2, 2022
@mumoshu
Copy link
Collaborator

mumoshu commented Mar 3, 2022

@genisd Thanks for confirming! I'll await your comeback 😄

@shivamag00
Copy link

shivamag00 commented Mar 28, 2022

FYI, I also faced this issue while using the v0.22.0. I was using AWS EKS. When I switched to v0.21.0, I did not face this issue.

In my case, the runner completes a job and exits with code 0, then the pod enters the NotReady state and Github removes it from self-hosted runner list. There was only one job and there was no other job alloted to the runner after the first job was executed by the runner which was successfully executed.

I followed the exact steps from the README.md

  • Installed cert-manager using kubectl
  • kubectl deployment for CRDs and actions-runner-controller
  • Used Github PAT with kubectl secret for authentication

Following is the runner.yaml that I used

apiVersion: actions.summerwind.dev/v1alpha1
kind: Runner
metadata:
  name: aws-eks-runner
spec:
  repository: some-github-repo
  env: []

@nicholasgibson2
Copy link

@shivamag00 I'm seeing the same behavior with v0.22.0, reverting to v0.21.0 fixes mine as well. With v0.22.0 it usually resolves itself after ~5 minutes. @mumoshu should we open a new issue for this?

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 28, 2022

@nicholasgibson2 Yeah! Please. A full context (your configuration, logs from runners and ARC controller-manager, etc) would be appreciated. Otherwise I cant debug issues like this. Thanks in advance for your cooperation!

@cspargo
Copy link
Contributor

cspargo commented Mar 29, 2022

i'm also seeing this. I've sometimes seen it take over 10 minutes for the NotReady pods to exit.

When it's in NotReady state (which happens almost every time a job completes), the "runner" container has exited cleanly, but docker container is still running.

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 29, 2022

@cspargo Thanks! A full context (your configuration, logs from runners and ARC controller-manager, etc), in a dedicated issue, would be appreciated 🙏 Because almost certainly it's a different issue than this one. (ARC's internal has changed considerably since then and NotReady is just the surface of the issue with hundreds of possible causes

@nicholasgibson2
Copy link

@mumoshu sorry for the delay, here's the new issue: #1291 let me know if you need any more info or want me to test anything on my end.

p.s. my coworkers and I are loving this project!

@maheshkhade
Copy link

To solve this Instead of kind: Runner I used kind: RunnerDeployment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants