-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Che-operator fails to manage che failing after database outage #20337
Comments
@guydog28 |
It might not hurt to also put a postgres init container on the che and keycloak deployments that blocks them from starting until postgres is up. This is also good for non-operator managed postgres since the operator cant control order there.
Additionally, since an external postgres could go down at any time, there should be a livenessProbe on the Che deployment, maybe one that looks for a 200 response to the url in the error screenshot (Options to /api/). If that fails the container would then restart and wait for the initContainer above. |
Fixed starting with init containers. However, the feature is disabled by default. See the PR on how to enable init containers. |
Doc in progress... |
@mmorhun what is the proper way to set this in the Che operator yaml? |
@guydog28 just add $ kubectl edit deployment che-operator -n eclipse-che and add the following under - name: ADD_COMPONENT_READINESS_INIT_CONTAINERS
value: "true" After that, Che Operator pod should restart and add init containers to Keycloak and Che Server deployments. |
@mmorhun will this environment variable survive a chectl update? I was thinking there would be something on the CheCluster CRD that would create this env var on the operator deployment. |
also, @mmorhun, this solves one of the issues (waiting for postgres/keycloak on start with an init container) but does not add liveness probes to che in the case that one of those goes down later, after a successful start. with a livenessProbe on the che deployment, it would see che is in an error state from postgres going down and terminate the pod - then when the deployment creates a new pod, the initContainer you created would take over block the new pod starting until postgres comes back online. |
Describe the bug
Our kubernetes cluster's primary goal is to serve che as the development environment for our team. Our kubernetes cluster is managed my kops. As a cost savings measure, our leadership has requested the cluster be shutdown overnight and on weekends (completely).
When the cluster comes back online in the morning, cluster state is restored from etcd backsups by etcd-manager. This means that the pods that were running when the cluster shut down are brought back up - so this time the operator isn't bringing them online and cant guarantee their order.
This results in a race condition where che comes up before postgres (and sometimes keycloak as well) and che is non-functional every morning. To get around this we have a cronjob that kills the che pod 15 minutes after the cluster comes online so that it will come up after postgres.
This process has led to very hard feelings by the team toward che (not great reliability). I'm sure this use case isn't a normal one, but it is more popular than you'd think.
I would request that the operator have more robust health checking for postgres, keycloak, and che. when one is failing (like che gave up connecting to postgres), restart them in the proper order accordingly to get them back to functional. The purpose of operators is to replace this sort of mundane but necessary task.
Che version
7.34@latest
Steps to reproduce
Expected behavior
The operator would better monitor postgres, keycloak, and che to determine issues, and restart them accordly and in the proper order.
Runtime
Kubernetes (vanilla)
Screenshots
Installation method
chectl/latest
Environment
Linux
Eclipse Che Logs
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: