-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating a minio tenant image creates downtime #2364
Comments
Add a
See both |
This makes sense, thanks. I do wonder though why the operator is not configuring the probe automatically. I would expect an operator to manage a well-defined deployment by default, and it would be a much better user experience if the readiness probe was not something a user has to worry about since it is an internal detail of MinIO. |
The operator does not configure your tenant readiness. This is an opt-in configuration in the tenant yaml; depending on the when you consider your tenant to be actually ready e.g. based on expected network timeouts, or some other complex logic. |
please feel free to reopen if the issue persists after trying the suggested solution. |
Thanks for your help @allanrogerr |
Expected Behavior
I would expect that updating the minio image for a tenant to be seamless and not lead to any downtime.
Current Behavior
The current behavior when updating the
.spec.image
field of a tenant is that the operator updates the statefulset image backing the tenant pool. This causes pods of the statefulset to roll one after another. While pods are rolling, pretty much all requests to MinIO start to fail.Even the console becomes very slow and sometimes times out.
Possible Solution
Steps to Reproduce (for bugs)
Create a tenant with a single pool, 16 nodes and 1 drive per node.
Update the image for the tenant and wait for the statefulset pods to start rolling.
Requests against MinIO will fail as long as pods are rolling.
Context
We see the following logs coming from all pods
I suspect this happens because the healtcheck takes up to 30s to mark a peer as unhealthy: https://github.com/minio/minio/blob/master/internal/rest/client.go#L471-L503. During that time, I don't think the resiliency mechanism kicks in, and requests are still routed to the peer.
This is made worse by the fact that as pods roll, new peers constantly become unavailable, but are not marked as unhealthy on time, which just extends the period for which MinIO as a whole is unavailable.
Your Environment
minio-operator
): minio/operator:v6.0.4The text was updated successfully, but these errors were encountered: