Regression: PR 2253 causes all-at-once tenant update #2332

mmulvanny · 2024-10-08T22:07:44Z

Expected Behavior

Updating a tenant should cause pods to update in a rolling fashion, and the MinIO service should remain available at all times.

Current Behavior

Updating to 6.0.3 causes the operator to delete all tenant pods at once, causing a MinIO outage.

Possible Solution

Steps to Reproduce (for bugs)

Deploy MinIO tenant with multiple pods managed by operator version 5.0.15
Upgrade operator to 6.0.3

Context

We upgraded the MinIO operator from 5.0.15 to 6.0.3.

Regression

This was caused by the combination of PR 2221 (which moved environment configuration to a sidecar) and PR 2253 (which deleted pods on configuration changes). Was PR 2253 intended to remove rolling updates?

Your Environment

Version used (minio-operator): This occurred immediately after an upgrade to 6.0.3.
Environment name and version (e.g. kubernetes v1.17.2): Kubernetes 1.28.7
Server type and version:
Operating System and version Ubuntu 20.04
Link to your deployment file:

The text was updated successfully, but these errors were encountered:

harshavardhana · 2024-10-08T22:13:49Z

There is no such thing as rolling updates @mmulvanny in our operator. We always employ in-place updates of the container binary and then subsequently statefulset rolls the changes.

However the cluster itself must be online way before this, please share the operator logs and let us make the right assessment on what happened.

Thanks

cesnietor · 2024-10-08T22:26:38Z

For upgrading to version 6 we have documentation since there are breaking changes:
please see:
https://min.io/docs/minio/kubernetes/eks/operations/install-deploy-manage/upgrade-minio-operator.html#upgrade-minio-operator-5-0-15-to-operator-version-stable

cesnietor · 2024-10-14T16:48:08Z

closing, please reopen if the docs don't help.

mmulvanny · 2024-10-14T20:30:31Z

We performed an upgrade of another instance today and ran into the same issue. We use Flux to manage our Helm releases, but our upgrade steps were equivalent to the Helm upgrade steps on the page @cesnietor linked. We upgraded the tenant's and the operator's Helm charts from 5.0.15 to 6.0.3 simultaneously.

Our operator log is here:

minio-operator-6.0.3-upgrade.log

@harshavardhana we always see exactly the behavior you described where the controller updates the Statefulset and then the pods restart one-by-one. That started to happen in this case too, but then the operator deleted the pods and forced them to start up together.

I tried modifying the environment variables of the tenant StatefulSet in our test environment to see if I could get the 6.0.3 operator to delete pods and wasn't able to. Was there a particular condition that would invoke the code path of PR 2253 that I failed to reproduce by doing that?

ramondeklein · 2024-10-15T13:31:52Z

@mmulvanny Unfortunately, Kubernetes doesn't allow an STS to restart all pods at once (it can do this for a deployment using the Recreate strategy). If there are updates to the STS, then Kubernetes will initiate a rolling restart. If the operator updates the STS with a new image, then Kubernetes will probably already start a rolling update.

The operator will force all pods to terminate (and thus restart all at once) when it detects that there was a change to any of the environment variables (starting with MINIO_) of the minio container in the statefulset.

mmulvanny added community triage labels Oct 8, 2024

cesnietor closed this as completed Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression: PR 2253 causes all-at-once tenant update #2332

Regression: PR 2253 causes all-at-once tenant update #2332

mmulvanny commented Oct 8, 2024

harshavardhana commented Oct 8, 2024

cesnietor commented Oct 8, 2024

cesnietor commented Oct 14, 2024

mmulvanny commented Oct 14, 2024

ramondeklein commented Oct 15, 2024

Regression: PR 2253 causes all-at-once tenant update #2332

Regression: PR 2253 causes all-at-once tenant update #2332

Comments

mmulvanny commented Oct 8, 2024

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Regression

Your Environment

harshavardhana commented Oct 8, 2024

cesnietor commented Oct 8, 2024

cesnietor commented Oct 14, 2024

mmulvanny commented Oct 14, 2024

ramondeklein commented Oct 15, 2024