-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runner controller failing in 0.18.2 #468
Comments
Did you update your CRDs as part of your upgrade to 18.2? How are you deploying the solution? |
I deleted everything, helm delete and then deleted all crds also. Then installed everything again with 0.18.2 but the problem still existed. After that I downgraded helm to 0.17.0
|
oh god I did not execute
and after
lets try again now |
hmm looks like this helm command does not update crds. https://helm.sh/docs/chart_best_practices/custom_resource_definitions/
great |
I can confirm that this problem is happening also with new crds. |
now it looks like this controller went crazy (tried 0.18.2, 0.18.1 and then back to 0.17.0). I have 2 replicas in both runnerdeployments:
one minute later: % kubectl get pods -n gha|wc -l after 5minutes: its somekind of loop adding new runners :) quite critical bug I would say |
yeh, Helm will only installs CRDs, it will not delete or update them. When doing an upgrade that involves CRD changes you need to uninstall everything and delete the CRDs.
Any interesting controller logs? can you post your RunnerDeployments? @mumoshu this has happened before during a similar scenario, #427 here it is Is it worth adding some sort of |
our runnerdeployments were
well there were like 100k rows of log.. impossible to investigate |
Are you able to grep for things like Do you have any other runners deployed? Do you have any |
#467 another relevant issue. There is a bit of a process updating CRDs with Helm which, when done wrong (albeit it's not very obvious if the problem is derived from Helms lack of CRD management support or if it is an actions-runner-controller issue), causes a massive spike in runners. It's well worth producing a runbook internally for doing Helm upgrades of this project as CRDs require a bit of extra care annoyingly. I personally follow:
If possible some log lines with |
Sorry to see you getting caught by this. Well, the short answer is you seem to need to delete runnerreplicaset that were created by v0.18. RunnerDeploment was created by you so just leave it there if it keeps working after you manually removed the runnerreplicaset created by v0.18. v0.17 controller won't handle v0.18 runnerreplicaset nicely. actions-runner-controller is not tested for forward compatibility so expecting v0.17 controller to gracefully handle runners and runner deployments created by v0.18 controller won't be justified much. Also, I'm not sure if there's any general way to prevent a K8s controller from gracefully handling "broken" CRDs. If we were to cover that in our implementation and tests, that would be a large effort that takes my time. That said, a sustainable way to further prevent this kind of issues can be to add some upgrade instruction(that just note about helm-upgrade doesn't upgrade CRDs and you need separate kubectl runs to upgrade CRDs first) in the README, or the GitHub release description. |
Its enough if there is just somekind of release notes "read this before updating" |
@zetaab I've just done my best writing it :) Closing as answered and resolved, but please feel free to submit another issues for whatever. I can think of potential new issues about adding docs, enhancing release note, ideas to improve the controller to not break in terrible ways with invalid CRDs, etc. |
If you have suggestions about how release notes should be written, check #481 and contribute changes to the new |
We updated runner controller from 0.17.0 to 0.18.2 and the whole thing does not work anymore. Runners are not coming online and in logs we can see
I assume this behaviour is maybe made in #398
cc @mumoshu
anyways we downgraded back to 0.17.0 and everything seems fine now.
The text was updated successfully, but these errors were encountered: