-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Controller keeps reconciling with non-existent runners - leads to infinite runners #512
Comments
@gstravinskaite Hey! You've probably missed updating CRDs. Please see #427, #467, and #468 |
TL;DR; There's no easy way to rollback if you broke your deployment by only upgrading the controller, and then broke it even more by downgrading the controller leaving runners created by the newer controller. If that's the case, all you could do would be to stop the controller at all, manually force delete the runners on K8s, and then go to the Actions page of your repository or organization and delete registered runners manually. |
I don't get this. Technically this is very unlikely to happen, as the controller tries to see the latest cache of controller-runtime to see what to reconcile. If you really remove all the runners(how did you do that? I though you had to |
I deleted CRDs and recreated them again with Helm. Is this what you mean by "upgrade"
The issue is that I did delete all the runners manually in the cluster AND under the Actions tab but the issue persists.
How did I remove the runners? Well, I removed the namespace and then yes, deleted finalizer and patched it. But I now see that the namespace deletion did not delete the runners. My bad. I will try to clear them. Thanks! |
Sounds good 👍
Alright! To be extra clear, what you needed to run |
we're having the same issue. the finalizer keeps us from deleting the runners. patching the runners generates the following error:
|
rolling back to 0.10.5 of the helm chart also generates this error |
I think I saw the same error, we patched the runners when having the controller still working. For us, the controller was actually deleting the runners by itself we noticed. I suppose we had a couple hundred thousands of them before. So I think if you do a complete wipe, delete CRDs and then deploy the old/new version of the controller again the controller should keep deleting runners. From our side, we just let the controller to run for a couple of hours and it deleted all the runners - things are more or less back to normal. Thinking that the upgrade documentation could be a useful thing here. |
#519 I'm going to work on moving away from the does it all action so we can better provide upgrade docs etc as part of the release |
@grggls You seem to have broken your admission webhook service somehow. Reading your logs, perhaps you've completely removed the K8s service named |
Closing as the original issue seems to have been resolved 👍 Thanks for reporting and all your support everyone! |
Hi,
We are using Helm charts to provision the controller and up until now were using version 16.1 with the chart version 2 and now tried to upgrade to 18.2 with the chart 10.4. We immediately ran into the infinite pod issue described in #427. I then reverted the version but the infinite pod scheduling still persisted. I must say that I upgraded the controller and then left it till the next morning and only then I created the runners to see the infinite pod scheduling issue. I then tried to wipe the state clean - destroyed helm chart, manually deleted the CDRs and upgraded to 18.2 again and the issue is still there. We started suspecting that perhaps the controller issued so many requests and GitHub is spewing them back to us up until now. Even though API request limit was exceeded (over 5000), I only saw around 200 runners scheduled (people in other issues mentioned as many as 2000 runners). We deleted the runners we saw but then we see that the controllers keeps reconciling with non existing runners. What is more interesting, is that the runner pod IDs seem to be not-changing since the time a lot of them span up. Leading to errors as:
runner-infra-xm9vc
is the ID of the old pod which we deleted and cleared the existing runners from GitHub.Today, we tried again to set up the runner and now no pods/runners are being scheduled but the controller is very busy and keeps spewing logs as:
Any help would be appreciated. This is the controller setup:
and this is for the runner:
The text was updated successfully, but these errors were encountered: