-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-46363: Remove narrow timeout from etcd bootstrap member removal gate. #9295
OCPBUGS-46363: Remove narrow timeout from etcd bootstrap member removal gate. #9295
Conversation
Waiting for the etcd bootstrap member to be removed from the etcd cluster has a timeout of five minutes, unless the overall timeout is reached first. This was apparently sufficient for most cases when the check was only in effect for single-node clusters, but it occasionally times out on HA clusters. Enforcing a timeout for this step alone is fragile. It is important to guarantee that the bootstrap resources are not torn before the etcd bootstrap member has been removed from the etcd cluster. The time spent waiting for it to happen will fluctuate based on how long it takes for the etcd operator to observe that it is safe to proceed without losing quorum.
Skipping CI for Draft Pull Request. |
/test all |
@benluddy: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@benluddy: This pull request references Jira Issue OCPBUGS-46363, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@benluddy: This pull request references Jira Issue OCPBUGS-46363, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/cc @tkashem |
/lgtm |
/assign @patrickdillon |
/approve Nice. Thank you! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: patrickdillon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
04e7a5d
into
openshift:master
@benluddy: Jira Issue OCPBUGS-46363: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-46363 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[ART PR BUILD NOTIFIER] Distgit: ose-installer-terraform-providers |
[ART PR BUILD NOTIFIER] Distgit: ose-installer-altinfra |
[ART PR BUILD NOTIFIER] Distgit: ose-baremetal-installer |
[ART PR BUILD NOTIFIER] Distgit: ose-installer-artifacts |
Waiting for the etcd bootstrap member to be removed from the etcd cluster has a timeout of five minutes, unless the overall timeout is reached first. This was apparently sufficient for most cases when the check was only in effect for single-node clusters, but it occasionally times out on HA clusters. Enforcing a timeout for this step alone is fragile. It is important to guarantee that the bootstrap resources are not torn before the etcd bootstrap member has been removed from the etcd cluster. The time spent waiting for it to happen will fluctuate based on how long it takes for the etcd operator to observe that it is safe to proceed without losing quorum.