Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating a minio tenant image creates downtime #2364

Closed
fpetkovski opened this issue Dec 6, 2024 · 5 comments
Closed

Updating a minio tenant image creates downtime #2364

fpetkovski opened this issue Dec 6, 2024 · 5 comments

Comments

@fpetkovski
Copy link

fpetkovski commented Dec 6, 2024

Expected Behavior

I would expect that updating the minio image for a tenant to be seamless and not lead to any downtime.

Current Behavior

The current behavior when updating the .spec.image field of a tenant is that the operator updates the statefulset image backing the tenant pool. This causes pods of the statefulset to roll one after another. While pods are rolling, pretty much all requests to MinIO start to fail.

Even the console becomes very slow and sometimes times out.

Possible Solution

Steps to Reproduce (for bugs)

Create a tenant with a single pool, 16 nodes and 1 drive per node.
Update the image for the tenant and wait for the statefulset pods to start rolling.

Requests against MinIO will fail as long as pods are rolling.

Context

We see the following logs coming from all pods

API: SYSTEM.grid
Time: 13:01:29 UTC 12/06/2024
DeploymentID: 4fd70aef-1659-4ec6-b6ac-546812282941
Error: grid: http://tenant-pool-0-15.tenant-hl.minio.svc.cluster.local:9000 re-connecting to http://tenant-pool-0-0.tenant-hl.minio.svc.cluster.local:9000: dial tcp 10.136.14.157:9000: i/o timeout (*net.OpError) Sleeping 0s (3) (*fmt.wrapError)
       6: internal/logger/logonce.go:118:logger.(*logOnceType).logOnceIf()
       5: internal/logger/logonce.go:149:logger.LogOnceIf()
       4: internal/grid/connection.go:59:grid.gridLogOnceIf()
       3: internal/grid/connection.go:672:grid.(*Connection).connect.func1()
       2: internal/grid/connection.go:678:grid.(*Connection).connect()
       1: internal/grid/connection.go:275:grid.newConnection.func3()

I suspect this happens because the healtcheck takes up to 30s to mark a peer as unhealthy: https://github.com/minio/minio/blob/master/internal/rest/client.go#L471-L503. During that time, I don't think the resiliency mechanism kicks in, and requests are still routed to the peer.

This is made worse by the fact that as pods roll, new peers constantly become unavailable, but are not marked as unhealthy on time, which just extends the period for which MinIO as a whole is unavailable.

Your Environment

  • Version used (minio-operator): minio/operator:v6.0.4
  • Environment name and version (e.g. kubernetes v1.17.2): v1.29.9-gke
@allanrogerr
Copy link
Contributor

Add a readinessProbe to your tenant spec e.g.

  readiness:
    httpGet: 
      path: /minio/health/cluster
      port: 9000
      scheme: HTTP
    initialDelaySeconds: 5
    periodSeconds: 1

See both
https://min.io/docs/minio/linux/operations/monitoring/healthcheck-probe.html
https://min.io/docs/minio/kubernetes/upstream/reference/operator-crd.html#tenantspec

sample-tenant.txt

@fpetkovski
Copy link
Author

fpetkovski commented Dec 8, 2024

This makes sense, thanks.

I do wonder though why the operator is not configuring the probe automatically. I would expect an operator to manage a well-defined deployment by default, and it would be a much better user experience if the readiness probe was not something a user has to worry about since it is an internal detail of MinIO.

@allanrogerr
Copy link
Contributor

The operator does not configure your tenant readiness. This is an opt-in configuration in the tenant yaml; depending on the when you consider your tenant to be actually ready e.g. based on expected network timeouts, or some other complex logic.

@cesnietor
Copy link
Contributor

please feel free to reopen if the issue persists after trying the suggested solution.

@fpetkovski
Copy link
Author

Thanks for your help @allanrogerr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants