Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix statefulset PreSync to work for certain cases of unhealthy etcd clusters upon druid upgrade #823

Merged
merged 3 commits into from
Aug 2, 2024

Conversation

shreyas-s-rao
Copy link
Contributor

@shreyas-s-rao shreyas-s-rao commented Jul 2, 2024

How to categorize this PR?

/area quality
/kind bug

What this PR does / why we need it:
This PR fixes statefulset PreSync behavior, to no longer wait for etcd pods to be updated with the latest sts spec, or to be ready. Instead, PreSync will now simply check whether the pods have the expected labels, and continue with the next steps. This fixes certain cases where an etcd cluster is unhealthy due to a wrong reconfiguration of the etcd configmap, causing one of the three etcd pods to fail, which causes the PreSync step to get stuck in the pod updated+ready check, and not allow druid to run Sync which reconciles and potentially fixes the issue by reverting the configmap to the correct state.

Which issue(s) this PR fixes:
Fixes #818

Special notes for your reviewer:
While working on this fix, @unmarshall and I found certain cases where PreSync cannot succeed. Consider the following case:

  1. Druid is running with old version (v0.22.0)
  2. 3-node etcd cluster is deployed and ready
  3. Etcd spec is updated with non-existent secret reference (TLS or backup store)
  4. Etcd is reconciled, causing the rollout of the statefulset pods to start, updating pod etcd-2 to go into error state since the mentioned secret does not exist.
    1. This unhealthy pod can be caused by any other reason as well, and not necessarily a misconfiguration. We simply use this as an example to show this case.
  5. Upgrade druid to v0.23.0 (or current master version)
  6. Reconcile the Etcd. This causes the PreSync to get stuck since the statefulset rollout is stuck since the pod never becomes ready.
  7. Druid marks the Etcd status.lastOperation as Operation: Reconcile, State: Failed.
  8. Now, the user sees this and figures out the Etcd spec misconfiguration issue, corrects the Etcd spec with correct configuration and reconciles the Etcd again. Druid will now succeed and the etcd cluster becomes healthy and also updated with new labels.
  9. Thus, misconfiguration or transient errors can be overcome by the current druid logic after this fix PR.
  10. But, if before the druid upgrade, there is a data corruption in the etcd cluster, causing a quorum loss, then the new druid PreSync will not be able to handle this, and this is a current limitation of the reconciliation logic today. Druid will still mark the Reconcile operation as Failed, and an operator will need to manually look into it and fix it by performing a recovery from quorum loss using this guide.

/invite @acumino

Release note:

Before upgrading druid to `v0.23.0+`, please ensure that druid is running with at least `v0.22.3+`. This is required to avoid any downtime during the upgrade of the etcds by the new druid version, as well as to ensure backward compatibility of your etcds, in case you wish to downgrade back to `v0.22.3+`.

@shreyas-s-rao shreyas-s-rao added this to the v0.23.0 milestone Jul 2, 2024
@shreyas-s-rao shreyas-s-rao requested a review from a team as a code owner July 2, 2024 14:32
@gardener-robot gardener-robot added the needs/review Needs review label Jul 2, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 2, 2024
@gardener-robot
Copy link

@shreyas-s-rao Command "/InviteCommand" failed with "Reviews may only be requested from collaborators. One or more of the users or teams you specified is not a collaborator of the gardener/etcd-druid repository.".

Additional Information
Redacted in public. Check backend logs.

@gardener-robot gardener-robot added area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug size/s Size of pull request is small (see gardener-robot robot/bots/size.py) labels Jul 2, 2024
@gardener-robot
Copy link

@shreyas-s-rao Command "/InviteCommand" failed with "Reviews may only be requested from collaborators. One or more of the users or teams you specified is not a collaborator of the gardener/etcd-druid repository.".

Additional Information
Redacted in public. Check backend logs.

@gardener-robot-ci-3 gardener-robot-ci-3 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 2, 2024
@seshachalam-yv
Copy link
Contributor

/assign

shreyas-s-rao added a commit to shreyas-s-rao/etcd-druid that referenced this pull request Jul 3, 2024
@ishan16696
Copy link
Member

/assign

api/v1alpha1/helper.go Outdated Show resolved Hide resolved
internal/component/statefulset/statefulset.go Outdated Show resolved Hide resolved
@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Jul 4, 2024
renormalize pushed a commit that referenced this pull request Jul 8, 2024
…which includes #777 (#804)

* Add new labels to sts pods, for backward compatibility with v0.23.0 which includes #777

* Fix unit tests for status.members checker

* Address review comment by @anveshreddy18; check exact match for sts label selector for sts recreation

* Statefulset PreDeploy now only checks pod labels, and not whether they are updated or ready, similar to #823

* Add comments for `PreDeploy` methods
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 9, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 9, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 9, 2024
@shreyas-s-rao
Copy link
Contributor Author

@seshachalam-yv thanks for your review. Your comments are addressed now. PTAL

@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 9, 2024
Copy link
Contributor

@seshachalam-yv seshachalam-yv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-robot gardener-robot added the reviewed/lgtm Has approval for merging label Jul 10, 2024
@gardener-robot gardener-robot removed needs/changes Needs (more) changes needs/review Needs review labels Jul 10, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 10, 2024
@shreyas-s-rao shreyas-s-rao self-assigned this Jul 25, 2024
@ishan16696
Copy link
Member

PreSync behaviour to no longer wait for etcd pods to be updated with the latest sts spec, or to be ready

If it's isn't waiting for pod to be ready/running, isn't that can cause a transient quorum loss ?

@ishan16696
Copy link
Member

ishan16696 commented Jul 25, 2024

Etcd spec is updated with non-existent secret reference (TLS or backup store)

we usually never go from TLS to non-TLS, Is there any other case where we want to do this ?

@ishan16696
Copy link
Member

Had a offline discussion with @shreyas-s-rao and he informed me that he didn't see the transient quorum loss while testing a upgrading from v0.22.3 to v0.23.0, but he observed a transient quorum loss while upgrading v0.22.0 to v0.23.0.
So, we concluded that he will mentioned this in a release note to upgrade the druid to v0.23.0 only atleast from v0.22.3.

Copy link
Member

@ishan16696 ishan16696 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

@gardener-robot
Copy link

@shreyas-s-rao Command "/InviteCommand" failed with "Reviews may only be requested from collaborators. One or more of the users or teams you specified is not a collaborator of the gardener/etcd-druid repository.".

Additional Information
Redacted in public. Check backend logs.

@gardener-robot gardener-robot added needs/review Needs review and removed reviewed/lgtm Has approval for merging labels Aug 2, 2024
@shreyas-s-rao
Copy link
Contributor Author

shreyas-s-rao commented Aug 2, 2024

I've added a breaking release note informing users to upgrade to v0.22.3+ before upgrading to v0.23.0.

@gardener-robot
Copy link

@shreyas-s-rao Command "/InviteCommand" failed with "Reviews may only be requested from collaborators. One or more of the users or teams you specified is not a collaborator of the gardener/etcd-druid repository.".

Additional Information
Redacted in public. Check backend logs.

1 similar comment
@gardener-robot
Copy link

@shreyas-s-rao Command "/InviteCommand" failed with "Reviews may only be requested from collaborators. One or more of the users or teams you specified is not a collaborator of the gardener/etcd-druid repository.".

Additional Information
Redacted in public. Check backend logs.

@shreyas-s-rao shreyas-s-rao merged commit fb44e67 into gardener:master Aug 2, 2024
11 checks passed
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Aug 2, 2024
@shreyas-s-rao shreyas-s-rao deleted the fix/presync branch August 2, 2024 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) size/s Size of pull request is small (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deleting the etcd-bootstrap configmap leads to etcd reconciliation to never succeed
8 participants